Holes in the spec - Githubissues

Greetings!

This is a close look at the bincode2 wire format specification, as of 04-03-12022 HE. It shall be noted that i look at the spec from the hypothetical perspective of someone who’s main programming language is not Rust, yet wishes to implement a compatible library for, say, C++.

From that perspective, an ideal long-term situation would be to have a separate spec document, say, a Markdown file, that is largely language agnostic. Being Markdown, the spec can then be handily included into crate docs via #![doc = include_str!()].

Warning: This all will read super-pedantic and nit-picky. But that is how specs work. You cannot have holes in a spec, which requires specifying the »obvious«.

Definitions

List machine types for the wire format, aka. basic types, and what their representation outside the wire format is. Such as:

• i16: A 16-bit signed 2’s complement binary integer value.
Enumerate jargon, such as what you mean by enum »variants« and »variant values«. Many programming languages call the »variant« a »discriminator« or »tag«. And many programming languages calling them »variant« don’t have »variant values«.
It’s also important to mention how the format treats data alignment and padding. This especially influences how to read the term »follows« in the spec. For depending on the alignment and padding strategy, »follows« can mean very different things.
It’s also nowhere stated in which order and in what amounts data is written or read. The implication is that you have a stream of bytes, FIFO.
- What happens if you write more data out than the target has capacity, i.e. what if you overflow the buffer all your data will eventually end up in?
- Is resuming write operations on a new stream legal, i.e. are you allowed to split a message up into chunks?
- What happens if you read to the end of your stream, but expected there to be more?
- Is continuing deserialisation on another stream valid, i.e. are you allowed to split a message up into chunks?
- What happens if you are done reading but the stream has more data to be read? What is that beyond-the-end data allowed to be?

Endian

By default bincode will serialize values in little endian encoding.

What values? You mean values of select »basic types«. You don’t really mean all values. If you were to look at values of entire structs, that’d involve reördering fields, viewing the entirety of the struct as a single to-be-byteswapped datum. I suggest just enumerating the basic types that have a specific byte order. Or if it happens to be all basic types, just saying »values of basic types« will suffice.

Basic Types

Whilst deserializing every other value [>1 for bool] will throw an error.

Adds a lot of error branches that could be avoided by demanding that true is serialised as 1, but that during deserialisation e.g. x > 0 is true. This would be a non-breaking change for the v1 wire format.

All basic numeric types will be encoded based on the configured IntEncoding.

There was no previous enumeration of what these »basic types« are, and thus there is also no previous mention of which non-numeric basic types exist.
f32 is a numeric type. According to this, f32 shall use the IntEncoding.
Is a char numeric? The Unicode Scalar Value may be a number, but what it represents is in most cases not.

All floating point types will take up exactly 4 (for f32) or 8 (for f64) bytes.

Clarify that by »floating point« you mean:

f32: IEEE 754-2008 binary32 floating-point number
f64: IEEE 754-2008 binary64 floating-point number
It is unclear how to handle NaN or subnormals. The assumption is keeping them as-is. However, there’s quiet and signalling NaNs, and one can attack software by sending signalling NaNs to crash it. So: Explicitly state that subnormals, NaNs, etc. are kept as-is, or state how these should be handled when serialising or deserialising.
See f32::from_bits

All tuples have no additional bytes, and are encoded in their specified order, e.g.

»Specified order« implies there being a separate spec on how to serialise tuples. What you want to say is: »Tuple fields are written first-to-last in declaration order, with no additional meta data.« Redundant to a new section mentioned later.

Btw., Unit, aka. () in Rust, is also a basic type worth mentioning.

Even more fun if you consider Never/Bottom, aka. ! or Infallible.

Also lacks a mention of how to encode char and what it represents and how to handle illegal chars. And similarly to bool, one could opt for having illegal Unicode Scalar Values at serialisation time being impossible and making deserialising illegal Unicode Scalar Values deserialise as the Unicode replacement character. And yeah, state that a char’s Unicode Scalar Value is encoded as u32. So there is no room for confusion with languages that have 8-bit chars (C, C++), 16-bit chars (Java, C#), or at-least-21-bit chars (the only sane ones).

Variant Encoding

Prefer using ≤ over <=.
Prefer using ¹²³⁴⁵⁶⁷⁸⁹⁰ over **.
Both, <= and ** have very different meanings from programming language to programming language.

followed by a [type] with value u

»Followed by the FixintEncoding for the value u of type [type].«

As it stands there, recursively applying VarintEncoding in an endless loop is spec-conforming.

usize is being encoded/decoded as a u64

Just an FYI that on RISC-V 128 usize is 128 bits in Rust. Ideally make the representation of usize configurable and have it be u64 by default for backwards compat.

Then, for signed integers, we first

Scratch the »then«.

Then, for signed integers, we first convert to unsigned using the zigzag algorithm, and then encode them as we do for unsigned integers generally.

Or replace the whole thing with: »Signed integers are first converted to unsigned integers of the same size using the zigzag algorithm. The unsigned result is then serialised using the VarintEncoding.

This zigzag algorithm is why it’s important to state that signed integers are encoded as 2’s complement. Further more, the zigzag algorithm as shown requires wrapping arithmetic, which is not stated anywhere.

This is the documentation for »VarintEncoding«, after all. »Generally« implies following the global config, which can be FixintEncoding.

the obvious algorithm, where we encode the cast values, gives a very large encoding for all negative values.

For some definition of obvious. Bounds-checking the negative size is a valid technique as well, cutting signed integers just as well in size, making deserialisation just a matter of sign-extension, but serialisation a wee bit more expensive. (Unless leading zero trickery is used.) The real benefit of zigzag is that the math is simple, it has few branches, and that it distributes computational overhead evenly across serialisation and deserialisation.

FixintEncoding

Fixed size integers are encoded directly

For some definition of directly. This is where you can move your endianness section to. State that you split integers up into arrays of bytes, in which order you split them up, and in which order these bytes are then written to the wire.

Enum discriminants are encoded as u32

As this reads, discriminants (change of terminology in the middle, btw.) will be »encoded directly«, for we are in the FixintEncoding section. State that: »Enum variants [or ›discriminants‹ or ›tags‹, pick one] are serialised as u32 integers using the IntEncoding.«

This now means that depending on the config they will use VarintEncoding or FixintEncoding, which is what i assume that sentence meant.

Better move this into an EnumEncoding section.

Lengths and usize are encoded as u64

Redundant. When lengths occur, just state they’re serialised as usize.

Enums are encoded with their variant first, followed by optionally the variant fields.

As this reads, a spec-conforming impl is allowed to ignore all variant fields and only write the discriminator. Better, with interchangable terminology:

Enums are encoded as their discriminants followed by any variant payload, if available.
The enum discriminant is serialised as a u32 using the IntEncoding.
- By the way, how are discriminants allocated/assigned? I.e. what about C style enums or enums with implicit discriminators?
Then serialise the associated data of that enum variant, if it exists:
- Tuple variant payloads (explain in Definitions section) use the TupleEncoding (make a section for this).
- Struct variant payloads (explain in Definitions section) use the StructEncoding (make a section for this).
- Unit variants have no payload, so only their discriminators are written.

Both named and unnamed fields are serialized with their values only, and therefor encode to the same value.

This can be scratched with the new TupleEncoding and StructEncoding sections.

TupleEncoding

State that tuple fields are serialised in first-to-last declaration order. Also note how field alignment and padding are handled.

StructEncoding

State that struct fields are serialised in first-to-last declaration order, with no meta data representing field names. Also note how field alignment and padding are handled.

Collections

Unspecified order in which collection entries should be iterated-over. Also unclear how e.g. a 2D augmented interval tree is supposed to be serialised. Or even just a binary tree. Both of which are special kinds of collections.

One major distinction is what the key type of a collection is:
- Numeric monotonically incrementing, e.g. for Vecs, linked lists, or flattened binary trees.
- Associated data, e.g. keys for HashMaps, or an axis-aligned bounding box for a 2D augmented interval tree. But also numeric indices for sparse lists.
Another is whether the collection has a sense of ordering.
- Unordered: HashMap
- Ordered: Vec, linked list,…
- Not strictly ordered on the wire, but better deserialisation performance if ordered: Flattened balanced trees, sparse lists,…

From these derive a guide on how collections »should« be implemented and where to look for how they are implemented. Write a formal spec for how bincode2 implements the standard collections.

Suggestion for a guide:

Linearly numerically indexed collections, such as arrays or linked lists, are serialised by iterating over all items in order from lowest to highest index and then serialising each item. This can be merged with the »Array« section. The only difference being whether or not to prefix the data with a length.
Other collections iterate over pairs of keys and associated values (Definitions section). Each of these pairs is then serialised by first serialising the key, then the value.
- For ordered data structures, such as sparse lists or flattened binary trees, the collections are iterated over in order from first to last key.
- For unordered data structures the iteration order is implementation-defined.

Biggest thing to spec:

»How must duplicate keys be handled?«
- Suggestion: Duplicate keys are an error, the message is invalid. This is the only not-surprising option that has no CVE potential.

All in all this section is very difficult to spec, given that it is very dependent on standard library and language feature quirks. But an implementer should know how to translate between e.g. a Rust Vec and a C++ std::vector, or between a Rust HashMap or a C++ std::unordered_map, etc.

Side note:

How would one canonically serialise a graph?

String and &str

No mention of UTF-8 and how to handle it. No mention of whether or not there is a terminating U+0000, no explanation for what »treated as« means.

»Text is serialised as a collection of bytes, using the UTF-8 encoding. There is no U+0000 NUL terminator, U+0000 is free to appear within the text. There is also no Byte Order Mark written, U+FEFF ZERO WIDTH NO-BREAK SPACE is free to appear within the text. During deserialisation, [whatever the UTF-8 error handling strategy is].«

Further more, being a data interchange format, specify how bincode2 handles Unicode Noncharacters.

There may still be a few holes, but brain’s exhausted. So there’s that, hope it helps.

bincode-org / bincode

Holes in the spec #517