Open unhammer opened 8 years ago
Yeah, it makes sense to use Word8
, since we already only store Word8
(and truncate away anything beyond the first byte). I don't really like ByteString
though, because its memory model is plainly bad for most purposes except FFI (fragmentation because of pinned memory). Unfortunately there isn't another String
-like type for plain non-UTF bytestrings, maybe Vector Word8
or Vector Char8
could be satisfactory.
Huh, had never heard of those downsides before, but that makes sense then.
Would http://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString-Short.html be an alternative?
That would be a better representation, but the obscurity and super-minimalistic API is a drawback.
There's a general lack of good minimum-overhead data structures in Haskell. From the top of my head:
Text
is not plain unicode string, but actually a slice, so the empty string costs 6 (!) machine words.ByteString
is actually a slice, moreover a slice over pinned memoryVector
is also a slice. I've seen 30% overhead in bytecode interpretation from Vector
slice offsets. Moreover most operations are mediated through stream bundle fusion, which sometimes causes bad slowdowns when fusion fails to trigger properly. Array
is also a slice (dammit slices!)primitive
has honest slice-less arrays and bytearrays, but the API is barely usable and almost non-existent. Moreover, primitive
doesn't even support SmallArray#
which is often very useful and has been available since GHC 7.10.But going back to this issue, I'm not sure when I'll get to refresh this package. There are a number of significant possible optimizations I can think of for construction & traversal, when I'll have time I'll do them and probably switch the API to SmallByteString
, and leave this issue open until then.
This seems to work for storing full UTF-8 stuff in a packed-dawg:
The test there would fail if we just truncated the bytes (e.g.
utf8-string
has thisData.ByteString.UTF8.toString
function that does that, and if that's used as the definition ofbsAsString
above, the test fails).Would it make sense to simply store ByteStrings/Word8's in the first place in
packed-dawg
? I know there isbytestring-trie
, which is a great package, but yourpacked-dawg
uses even less memory (due to only keys, no values I guess?), while not being that much slower.