bincode-org / bincode

A binary encoder / decoder implementation in Rust.
MIT License
2.68k stars 269 forks source link

Holes in the spec #517

Open Evrey opened 2 years ago

Evrey commented 2 years ago

Greetings!

This is a close look at the bincode2 wire format specification, as of 04-03-12022 HE. It shall be noted that i look at the spec from the hypothetical perspective of someone who’s main programming language is not Rust, yet wishes to implement a compatible library for, say, C++.

From that perspective, an ideal long-term situation would be to have a separate spec document, say, a Markdown file, that is largely language agnostic. Being Markdown, the spec can then be handily included into crate docs via #![doc = include_str!()].

Warning: This all will read super-pedantic and nit-picky. But that is how specs work. You cannot have holes in a spec, which requires specifying the »obvious«.

Definitions

Endian

By default bincode will serialize values in little endian encoding.

What values? You mean values of select »basic types«. You don’t really mean all values. If you were to look at values of entire structs, that’d involve reördering fields, viewing the entirety of the struct as a single to-be-byteswapped datum. I suggest just enumerating the basic types that have a specific byte order. Or if it happens to be all basic types, just saying »values of basic types« will suffice.

Basic Types

Whilst deserializing every other value [>1 for bool] will throw an error.

Adds a lot of error branches that could be avoided by demanding that true is serialised as 1, but that during deserialisation e.g. x > 0 is true. This would be a non-breaking change for the v1 wire format.

All basic numeric types will be encoded based on the configured IntEncoding.

All floating point types will take up exactly 4 (for f32) or 8 (for f64) bytes.

Clarify that by »floating point« you mean:

All tuples have no additional bytes, and are encoded in their specified order, e.g.

»Specified order« implies there being a separate spec on how to serialise tuples. What you want to say is: »Tuple fields are written first-to-last in declaration order, with no additional meta data.« Redundant to a new section mentioned later.

Btw., Unit, aka. () in Rust, is also a basic type worth mentioning.

Even more fun if you consider Never/Bottom, aka. ! or Infallible.

Also lacks a mention of how to encode char and what it represents and how to handle illegal chars. And similarly to bool, one could opt for having illegal Unicode Scalar Values at serialisation time being impossible and making deserialising illegal Unicode Scalar Values deserialise as the Unicode replacement character. And yeah, state that a char’s Unicode Scalar Value is encoded as u32. So there is no room for confusion with languages that have 8-bit chars (C, C++), 16-bit chars (Java, C#), or at-least-21-bit chars (the only sane ones).

Variant Encoding

followed by a [type] with value u

»Followed by the FixintEncoding for the value u of type [type].«

As it stands there, recursively applying VarintEncoding in an endless loop is spec-conforming.

usize is being encoded/decoded as a u64

Just an FYI that on RISC-V 128 usize is 128 bits in Rust. Ideally make the representation of usize configurable and have it be u64 by default for backwards compat.

Then, for signed integers, we first

Scratch the »then«.

Then, for signed integers, we first convert to unsigned using the zigzag algorithm, and then encode them as we do for unsigned integers generally.

Or replace the whole thing with: »Signed integers are first converted to unsigned integers of the same size using the zigzag algorithm. The unsigned result is then serialised using the VarintEncoding.

This zigzag algorithm is why it’s important to state that signed integers are encoded as 2’s complement. Further more, the zigzag algorithm as shown requires wrapping arithmetic, which is not stated anywhere.

This is the documentation for »VarintEncoding«, after all. »Generally« implies following the global config, which can be FixintEncoding.

the obvious algorithm, where we encode the cast values, gives a very large encoding for all negative values.

For some definition of obvious. Bounds-checking the negative size is a valid technique as well, cutting signed integers just as well in size, making deserialisation just a matter of sign-extension, but serialisation a wee bit more expensive. (Unless leading zero trickery is used.) The real benefit of zigzag is that the math is simple, it has few branches, and that it distributes computational overhead evenly across serialisation and deserialisation.

FixintEncoding

Fixed size integers are encoded directly

For some definition of directly. This is where you can move your endianness section to. State that you split integers up into arrays of bytes, in which order you split them up, and in which order these bytes are then written to the wire.

Enum discriminants are encoded as u32

As this reads, discriminants (change of terminology in the middle, btw.) will be »encoded directly«, for we are in the FixintEncoding section. State that: »Enum variants [or ›discriminants‹ or ›tags‹, pick one] are serialised as u32 integers using the IntEncoding.«

This now means that depending on the config they will use VarintEncoding or FixintEncoding, which is what i assume that sentence meant.

Better move this into an EnumEncoding section.

Lengths and usize are encoded as u64

Redundant. When lengths occur, just state they’re serialised as usize.

Enums are encoded with their variant first, followed by optionally the variant fields.

As this reads, a spec-conforming impl is allowed to ignore all variant fields and only write the discriminator. Better, with interchangable terminology:

Both named and unnamed fields are serialized with their values only, and therefor encode to the same value.

This can be scratched with the new TupleEncoding and StructEncoding sections.

TupleEncoding

State that tuple fields are serialised in first-to-last declaration order. Also note how field alignment and padding are handled.

StructEncoding

State that struct fields are serialised in first-to-last declaration order, with no meta data representing field names. Also note how field alignment and padding are handled.

Collections

Unspecified order in which collection entries should be iterated-over. Also unclear how e.g. a 2D augmented interval tree is supposed to be serialised. Or even just a binary tree. Both of which are special kinds of collections.

From these derive a guide on how collections »should« be implemented and where to look for how they are implemented. Write a formal spec for how bincode2 implements the standard collections.

Suggestion for a guide:

Biggest thing to spec:

All in all this section is very difficult to spec, given that it is very dependent on standard library and language feature quirks. But an implementer should know how to translate between e.g. a Rust Vec and a C++ std::vector, or between a Rust HashMap or a C++ std::unordered_map, etc.

Side note:

String and &str

No mention of UTF-8 and how to handle it. No mention of whether or not there is a terminating U+0000, no explanation for what »treated as« means.

»Text is serialised as a collection of bytes, using the UTF-8 encoding. There is no U+0000 NUL terminator, U+0000 is free to appear within the text. There is also no Byte Order Mark written, U+FEFF ZERO WIDTH NO-BREAK SPACE is free to appear within the text. During deserialisation, [whatever the UTF-8 error handling strategy is].«

Further more, being a data interchange format, specify how bincode2 handles Unicode Noncharacters.


There may still be a few holes, but brain’s exhausted. So there’s that, hope it helps.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

VictorKoenders commented 2 years ago

No, bad bot