AndrasKovacs / dawg

Generation and traversal of highly compressed directed acyclic word graphs.
Other
7 stars 3 forks source link

Use ByteString's instead? #1

Open unhammer opened 8 years ago

unhammer commented 8 years ago

This seems to work for storing full UTF-8 stuff in a packed-dawg:

import           Data.Text.Internal.Unsafe.Char (unsafeChr8)
import qualified Data.ByteString                as BS
import qualified Data.DAWG.Packed               as DP
import           Data.Text                         (Text)
import           Data.Text.Encoding       (encodeUtf8)

bsAsString :: BS.ByteString -> [Char]
bsAsString t = map unsafeChr8 $ BS.unpack t
textAsString :: Text -> [Char]
textAsString = bsAsString .  encodeUtf8

test =
   let d = DP.fromList $ map textAsString $ ["ł"] in
   DP.member (textAsString "ł") d && not (DP.member (textAsString "B") d)

The test there would fail if we just truncated the bytes (e.g. utf8-string has this Data.ByteString.UTF8.toString function that does that, and if that's used as the definition of bsAsString above, the test fails).

Would it make sense to simply store ByteStrings/Word8's in the first place in packed-dawg? I know there is bytestring-trie, which is a great package, but your packed-dawg uses even less memory (due to only keys, no values I guess?), while not being that much slower.

AndrasKovacs commented 8 years ago

Yeah, it makes sense to use Word8, since we already only store Word8 (and truncate away anything beyond the first byte). I don't really like ByteString though, because its memory model is plainly bad for most purposes except FFI (fragmentation because of pinned memory). Unfortunately there isn't another String-like type for plain non-UTF bytestrings, maybe Vector Word8 or Vector Char8 could be satisfactory.

unhammer commented 8 years ago

Huh, had never heard of those downsides before, but that makes sense then.

Would http://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString-Short.html be an alternative?

AndrasKovacs commented 8 years ago

That would be a better representation, but the obscurity and super-minimalistic API is a drawback.

There's a general lack of good minimum-overhead data structures in Haskell. From the top of my head:

But going back to this issue, I'm not sure when I'll get to refresh this package. There are a number of significant possible optimizations I can think of for construction & traversal, when I'll have time I'll do them and probably switch the API to SmallByteString, and leave this issue open until then.