Matthew-Mosior / text-compression

https://hackage.haskell.org/package/text-compression-0.1.0.25
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Any strategies for rendering the compression result? #35

Open ddssff opened 1 year ago

ddssff commented 1 year ago

I have looked over this package briefly, and I am wondering how you could take the result of, say, textToBWTToRLET :: Text -> RLET and convert it to a string in such a way that it can be uniquely converted back to an RLET, so it could then be decompressed? I read over some general RLE papers and they seem to always use examples that have no digits.

Matthew-Mosior commented 1 year ago

Hi @ddssff, thanks for taking a look at this package!

Just to make sure I am understanding correctly, you would like to be able to convert the result of textToBWTToRLET (a RLET) and convert it to a String, and then be able to take said String and convert it back to the RLET?

One of the advantages to this implementation of RLE (or for that matter MTF or FM-Index) on an arbitrary ByteString or Text input is that the virtual EOF character (many academic papers going over RLE, MTF or FM-Index use the $ character as the virtual EOF character) is replaced with a Nothing (hopefully keeping this implementation as Haskelly as possible). To look into some initial discussion I had when I was first authoring this package, please see this reddit thread: https://www.reddit.com/r/haskell/comments/yjc4qv/new_hackage_library_textcompression

There isn't a function in this package that turns the Nothing into a virtual EOF character of choice, but this is a cool idea! I suppose there would have to be the inverse functionality as well, which would be a function that takes the String/Text as well as the virtual EOF character that was defined by the user.

I'll look into adding this functionality.

ddssff commented 1 year ago

Yes, that's exactly what I'm looking for. I love the representation you are using, but some helper functions at the end for converting that representation to and from something string-like. In my particular case I'd like to use it as an html property, so that has its own constraints. This might turn into a whole thing of its own about escaping and un-escaping.

ddssff commented 1 year ago

Actually this might best be integrated with https://hackage.haskell.org/package/zenc - it looks like it only outputs alphanumeric characters, so one could use _ or $ for the eol.

Matthew-Mosior commented 1 year ago

Great! Yeah this shouldn't be hard to do at all.

Based on the intention of this package (compression algorithms surrounding textual data), I think it may be best to keep this functionality within this package? It appears that the zenc (https://hackage.haskell.org/package/zenc) has a different focus (String -> C name) than that of pure textual compression.

I'm thinking for this functionality, it could choose from an ordered list of "preferred" characters to use (to replace the Nothing with). That way, if the RLET contains, say $ or _, then it should still be able to produce a String. Maybe the ordering should be something like: $ -> _ -> @ -> etc.

To invert back to the RLET from the String, the user would have to specific the appropriate character that was used for the conversion (to know how to convert it back).

@ddssff what do you think?

ddssff commented 1 year ago

Do we need to worry about distinguishing the RLE counts from digits in the string?

Matthew-Mosior commented 1 year ago

@ddssff great catch, we certainly do.

Maybe a custom parser can ensure appropriate positioning of counts vs. data and then these can allow us to differentiate when deciding upon the preferred virtual EOF character.

ddssff commented 1 year ago

In the case of zenc, there may be enough left over ASCII characters to create an alternative set of digits to be used for the run lengths. So more generally, a set of characters which represent themselves (the "preferred" characters above) and a second list of characters to be treated as digits? And an EOF character.

Matthew-Mosior commented 1 year ago

Would the implicit structure of the RLE help here?

For any ‘RLET’, all even (starting with zero) characters would be the run length and the odd (starting with one) would be the digit/character of the original string.

Maybe it’s possible to figure out the set of characters in the run lengths and associated characters and then pick one that doesn’t exist in either as the EOF character? My apologies if I’m misunderstanding you.

ddssff commented 1 year ago

Whatever works and makes a short string!

Matthew-Mosior commented 1 year ago

Sounds good! Let me work on this, and then I’ll test and release a new version to hackage.

ddssff commented 1 year ago

Poking around in the code, it looks to me like a better representation for the Seq elements than Maybe a would be Maybe (Either Int a). But that could be a bit of work to convert. I tried doing a little:

diff --git a/src/Data/RLE/Internal.hs b/src/Data/RLE/Internal.hs
index 5f637f9..227081e 100644
--- a/src/Data/RLE/Internal.hs
+++ b/src/Data/RLE/Internal.hs
@@ -3,7 +3,7 @@
 {-# LANGUAGE Strict           #-}
 {-# LANGUAGE DeriveGeneric    #-}
 {-# LANGUAGE TypeApplications #-}
-
+{-# LANGUAGE CPP              #-}

 -- |
 -- Module      :  Data.RLE.Internal
@@ -76,7 +76,8 @@ module Data.RLE.Internal ( -- * Base RLE types
                            FSTRLESeqT,
                            pushFSTRLESeqT,
                            emptyFSTRLESeqT,
-                           seqFromRLET
+                           seqFromRLET,
+                           Item, left, left', fromLeft, right
                          ) where

 import Control.Monad as CM
@@ -97,12 +98,18 @@ import Prelude as P

 {-Base level types.-}

+type Item a = Maybe (Either Int a)
+left = Left
+left' = Left
+fromLeft = (\(Left a) -> a)
+right = Right
+
 -- | Basic RLE ('ByteString') data type.
-newtype RLEB = RLEB (Seq (Maybe ByteString))
+newtype RLEB = RLEB (Seq (Item ByteString))
   deriving (Eq,Ord,Show,Read,Generic)

 -- | Basic RLE ('Text') data type.
-newtype RLET = RLET (Seq (Maybe Text))
+newtype RLET = RLET (Seq (Item Text))
   deriving (Eq,Ord,Show,Read,Generic)

 {-------------------}
@@ -111,14 +118,14 @@ newtype RLET = RLET (Seq (Maybe Text))
 {-toRLE (ByteString) functions.-}

 -- | Abstract 'RLESeqB' type utilizing a 'Seq'.
-type RLESeqB = Seq (Maybe ByteString)
+type RLESeqB = Seq (Item ByteString)

 -- | Abstract data type representing a 'RLESeqB' in the (strict) ST monad.
 type STRLESeqB s a = STRef s RLESeqB

 -- | State function to push 'RLESeqB' data into stack.
-pushSTRLESeqB :: STRLESeqB s (Maybe ByteString)
-              -> Maybe ByteString
+pushSTRLESeqB :: STRLESeqB s (Item ByteString)
+              -> Item ByteString
               -> ST s ()
 pushSTRLESeqB s Nothing  = do
   s2 <- readSTRef s
@@ -132,18 +139,18 @@ emptySTRLESeqB :: ST s (STRLESeqB s a)
 emptySTRLESeqB = newSTRef DS.empty

 -- | Abstract 'STRLETempB' and associated state type.
-type STRLETempB s a = STRef s (Maybe ByteString)
+type STRLETempB s a = STRef s (Item ByteString)

 -- | State function to update 'STRLETempB'.
-updateSTRLETempB :: STRLETempB s (Maybe ByteString)
-                 -> Maybe ByteString
+updateSTRLETempB :: STRLETempB s (Item ByteString)
+                 -> Item ByteString
                  -> ST s ()
 updateSTRLETempB s Nothing  = writeSTRef s Nothing
 updateSTRLETempB s (Just e) = writeSTRef s (Just e)

 -- | State function to create empty 'STRLETempB' type.
 emptySTRLETempB :: ST s (STRLETempB s a)
-emptySTRLETempB = newSTRef (Just BS.empty)
+emptySTRLETempB = newSTRef (Just (right BS.empty))

 -- | Abstract 'STRLECounterB' state type.
 type STRLECounterB s a = STRef s Int
@@ -185,8 +192,7 @@ seqToRLEB (x DS.:<| xs) = do
         cbrlets <- readSTRef brlets
         pushSTRLESeqB brless
                       (Just      $
-                       BSC8.pack $
-                       show cbrlecs)
+                       left cbrlecs)
         pushSTRLESeqB brless
                       cbrlets
         pure ()
@@ -196,14 +202,12 @@ seqToRLEB (x DS.:<| xs) = do
         if | isNothing y
            -> do pushSTRLESeqB brless
                                (Just      $
-                                BSC8.pack $
-                                show cbrlecs)
+                                left cbrlecs)
                  pushSTRLESeqB brless
                                cbrlets
                  pushSTRLESeqB brless
                                (Just      $
-                                BSC8.pack $
-                                show (1 :: Int))
+                                left (1 :: Int))
                  pushSTRLESeqB brless
                                Nothing
                  updateSTRLETempB brlets
@@ -231,8 +235,7 @@ seqToRLEB (x DS.:<| xs) = do
            | otherwise
            -> do pushSTRLESeqB brless
                                (Just      $
-                                BSC8.pack $
-                                show cbrlecs)
+                                left cbrlecs)
                  pushSTRLESeqB brless
                                cbrlets
                  updateSTRLECounterB brlecs
@@ -250,14 +253,14 @@ seqToRLEB (x DS.:<| xs) = do
 {-toRLE (Text) functions.-}

 -- | Abstract 'RLESeqT' type utilizing a 'Seq'.
-type RLESeqT = Seq (Maybe Text)
+type RLESeqT = Seq (Item Text)

 -- | Abstract data type representing a 'RLESeqT' in the (strict) ST monad.
 type STRLESeqT s a = STRef s RLESeqT

 -- | State function to push 'RLESeqT' data into stack.
-pushSTRLESeqT :: STRLESeqT s (Maybe Text)
-              -> (Maybe Text)
+pushSTRLESeqT :: STRLESeqT s (Item Text)
+              -> (Item Text)
               -> ST s ()
 pushSTRLESeqT s Nothing  = do
   s2 <- readSTRef s
@@ -271,18 +274,18 @@ emptySTRLESeqT :: ST s (STRLESeqT s a)
 emptySTRLESeqT = newSTRef DS.empty

 -- | Abstract 'STRLETempT' state type.
-type STRLETempT s a = STRef s (Maybe Text)
+type STRLETempT s a = STRef s (Item Text)

 -- | State function to update 'STRLETempT'.
-updateSTRLETempT :: STRLETempT s (Maybe Text)
-                 -> (Maybe Text)
+updateSTRLETempT :: STRLETempT s (Item Text)
+                 -> (Item Text)
                  -> ST s ()
 updateSTRLETempT s Nothing  = writeSTRef s Nothing
 updateSTRLETempT s (Just e) = writeSTRef s (Just e)

 -- | State function to create empty 'STRLETempT' type.
 emptySTRLETempT :: ST s (STRLETempT s a)
-emptySTRLETempT = newSTRef (Just DText.empty)
+emptySTRLETempT = newSTRef (Just (right DText.empty))

 -- | Abstract 'STRLECounterT' and associated state type.
 type STRLECounterT s a = STRef s Int
@@ -324,8 +327,7 @@ seqToRLET (x DS.:<| xs) = do
         ctrlets <- readSTRef trlets
         pushSTRLESeqT trless
                       (Just       $
-                       DText.pack $
-                       show ctrlecs)
+                       left' ctrlecs)
         pushSTRLESeqT trless
                       ctrlets
         pure ()
@@ -335,14 +337,12 @@ seqToRLET (x DS.:<| xs) = do
         if | isNothing y
            -> do pushSTRLESeqT trless
                                (Just       $
-                                DText.pack $
-                                show ctrlecs)
+                                left' ctrlecs)
                  pushSTRLESeqT trless
                                ctrlets
                  pushSTRLESeqT trless
                                (Just       $
-                                DText.pack $
-                                show (1 :: Int))
+                                left' (1 :: Int))
                  pushSTRLESeqT trless
                                Nothing
                  updateSTRLETempT trlets
@@ -370,8 +370,7 @@ seqToRLET (x DS.:<| xs) = do
            | otherwise
            -> do pushSTRLESeqT trless
                                (Just       $
-                                DText.pack $
-                                show ctrlecs)
+                                left' ctrlecs)
                  pushSTRLESeqT trless
                                ctrlets
                  updateSTRLECounterT trlecs
@@ -389,14 +388,14 @@ seqToRLET (x DS.:<| xs) = do
 {-fromRLE (ByteString) functions.-}

 -- | Abstract 'FRLESeqB' type utilizing a 'Seq'.
-type FRLESeqB = Seq (Maybe ByteString)
+type FRLESeqB = Seq (Item ByteString)

 -- | Abstract data type representing a 'FRLESeqB' in the (strict) ST monad.
 type FSTRLESeqB s a = STRef s FRLESeqB

 -- | State function to push 'FRLESeqB' data into stack.
-pushFSTRLESeqB :: FSTRLESeqB s (Maybe ByteString)
-               -> (Maybe ByteString)
+pushFSTRLESeqB :: FSTRLESeqB s (Item ByteString)
+               -> (Item ByteString)
                -> ST s ()
 pushFSTRLESeqB s Nothing  = do
   s2 <- readSTRef s
@@ -431,9 +430,8 @@ seqFromRLEB xs              = do
                                 Nothing
                  pure ()
            | otherwise
-           -> do let y1' = read        $
-                           BSC8.unpack $
-                           fromJust y1 :: Int
+           -> do let y1' :: Int
+                     y1' = (\(Left a) -> a) (fromJust y1)
                  let y2' = fromJust y2
                  CM.replicateM_ y1'
                                 (pushFSTRLESeqB fbrless
@@ -447,8 +445,7 @@ seqFromRLEB xs              = do
                  iFRLEB ys
                         fbrless
            | otherwise
-           -> do let y1' = read        $
-                           BSC8.unpack $
+           -> do let y1' = fromLeft $
                            fromJust y1 :: Int
                  let y2' = fromJust y2
                  CM.replicateM_ y1'
@@ -466,14 +463,14 @@ seqFromRLEB xs              = do
 {-fromRLE (Text) functions.-}

 -- | Abstract 'FRLESeqT' type utilizing a 'Seq'.
-type FRLESeqT = Seq (Maybe Text)
+type FRLESeqT = Seq (Item Text)

 -- | Abstract data type representing a 'FRLESeqT' in the (strict) ST monad.
 type FSTRLESeqT s a = STRef s FRLESeqT

 -- | State function to push 'FSTRLESeqT' data into stack.
-pushFSTRLESeqT :: FSTRLESeqT s (Maybe Text)
-               -> (Maybe Text)
+pushFSTRLESeqT :: FSTRLESeqT s (Item Text)
+               -> (Item Text)
                -> ST s ()
 pushFSTRLESeqT s Nothing  = do
   s2 <- readSTRef s
@@ -508,8 +505,7 @@ seqFromRLET xs              = do
                                 Nothing
                  pure ()
            | otherwise
-           -> do let y1' = read         $
-                           DText.unpack $
+           -> do let y1' = fromLeft $
                            fromJust y1 :: Int
                  let y2' = fromJust y2
                  CM.replicateM_ y1'
@@ -524,8 +520,7 @@ seqFromRLET xs              = do
                  iFRLET ys
                         ftrless
            | otherwise
-           -> do let y1' = read         $
-                           DText.unpack $
+           -> do let y1' = fromLeft $
                            fromJust y1 :: Int
                  let y2' = fromJust y2
                  CM.replicateM_ y1'

A little test suite would help here.

Matthew-Mosior commented 1 year ago

This is great, thank you! Let me take a crack at switching over to that representation. And totally agree, a test suite is in need indeed.

I’m on vacation now, so I probably won’t be able to work on this until next week.

ddssff commented 1 year ago

Actually, even better than Maybe (Either Int a) would be Maybe (Run Int a), where each run included a count and a character.