Closed raehik closed 1 year ago
I'm getting a test failure on my local GHC 8.8.4 for big-endian Word16
and Word32
parse tests. For Word16
:
32768
is selected0x80 0x00
(big endian)Similar for Word32
. It sounds like the corresponding signed type IntX
is poking its head in (where 32768 overflows to -32768). I have no idea why this might happen, it doesn't for any of my other local GHCs or on CI. Just a heads up, I suppose...
Re above, I'm relatively sure it's a weird local issue due to using a newer LLVM (works fine when I turn off -fllvm
). Surprising, but GHC does complain when compiling, so I suppose there was a reason.
This is ready, but could be tweaked: commentary is rough and I kind of threw the ZigZag stuff in there arbitrarily (it's only exported from an internal module).
This feels a bit out of place, since this looke like it's specifically for parsing protobuf bytestrings, or this is a widely used encoding outside of protobuf?
There are variations used in lots of file/message encodings. This one should also match LEB128 (which I haven't heard of before, but apparently deserves a Wikipedia page). Everyone does something slightly different though, switching how continuations are done (on 0 or on 1), altering representation to remove redundancy (Git's varints).
I wrote my own parsers for these, but handling them in flatparse would be faster. They feel like useful, if niche, parsers. In a similar vein I'd like to provide an octal ASCII digit integer parser -- seems really weird, but tar
uses them everywhere, and they aren't much different to hex/decimal ASCII digit integer parsers.
I suppose they could go directly into packages that handle the specific format/encoding. But the integer encoding is a general building block that might show up elsewhere. It would be convenient to provide efficient parsers for the most common ones.
Variable length integer encodings (varints) are a pain, usually necessitating byte-by-byte parsing. So at the very least doing it in Flatparse would speed things up tremendously.
There are a few common varint encodings. protobuf's encoding is pretty common so this is the one introduced here. I've got some slow parsers for other ones too, namely Git's varints (which remove encoding redundancy), BPS (a binary patch format) and VLQ (yet another standard).