Add parser for variable-length integers

AndrasKovacs / flatparse

Fast parsing from bytestrings

MIT License

146 stars 12 forks source link

Add parser for variable-length integers #31

Closed raehik closed 1 year ago

raehik commented 1 year ago

Variable length integer encodings (varints) are a pain, usually necessitating byte-by-byte parsing. So at the very least doing it in Flatparse would speed things up tremendously.

There are a few common varint encodings. protobuf's encoding is pretty common so this is the one introduced here. I've got some slow parsers for other ones too, namely Git's varints (which remove encoding redundancy), BPS (a binary patch format) and VLQ (yet another standard).

raehik commented 1 year ago

I'm getting a test failure on my local GHC 8.8.4 for big-endian Word16 and Word32 parse tests. For Word16:

the value 32768 is selected
it is correctly serialized to 0x80 0x00 (big endian)
the test fails with "expected 32768, got -32768"

Similar for Word32. It sounds like the corresponding signed type IntX is poking its head in (where 32768 overflows to -32768). I have no idea why this might happen, it doesn't for any of my other local GHCs or on CI. Just a heads up, I suppose...

raehik commented 1 year ago

Re above, I'm relatively sure it's a weird local issue due to using a newer LLVM (works fine when I turn off -fllvm). Surprising, but GHC does complain when compiling, so I suppose there was a reason.

raehik commented 1 year ago

This is ready, but could be tweaked: commentary is rough and I kind of threw the ZigZag stuff in there arbitrarily (it's only exported from an internal module).

dminuoso commented 1 year ago

This feels a bit out of place, since this looke like it's specifically for parsing protobuf bytestrings, or this is a widely used encoding outside of protobuf?

raehik commented 1 year ago

There are variations used in lots of file/message encodings. This one should also match LEB128 (which I haven't heard of before, but apparently deserves a Wikipedia page). Everyone does something slightly different though, switching how continuations are done (on 0 or on 1), altering representation to remove redundancy (Git's varints).

I wrote my own parsers for these, but handling them in flatparse would be faster. They feel like useful, if niche, parsers. In a similar vein I'd like to provide an octal ASCII digit integer parser -- seems really weird, but tar uses them everywhere, and they aren't much different to hex/decimal ASCII digit integer parsers.

I suppose they could go directly into packages that handle the specific format/encoding. But the integer encoding is a general building block that might show up elsewhere. It would be convenient to provide efficient parsers for the most common ones.