Canonical implementation limits?

adzialocha commented 2 years ago

I remember a long walk in the park where you mentioned the limits of serde and the std library to implement canonical format support in this crate 🙂

What part of the canonical specification can not be realized and why?

AljoschaMeyer commented 2 years ago

Huh, it has been a while, let's see what I can reconstruct...

Suppose you want deserialization of canonic array. The specification requires that "arrays that can be encoded as strings must be encoded as strings". So our deserializer needs to track with every array item that reads whether that the array so far could have been encoded as a string instead. After the last item has been read, if every single item was a u8, it should emit an error.

In the code, this would happen here. We read a value and assign to the variable inner. How can you find out whether that value could have been (or was) encoded as a u8? The generics work against us here, and the serde data model doesn't have a trait to help us with that question.

Additionally, we have the constraint that "maps that can be encoded as set must be encoded as sets", which has similar problems (although it might actually be possible to detect nil in some hacky way); also serde has no conception of sets in the first place, which may have also contributed problems.

Checking whether the entries of a map are fed to the serializer in ascending order may also have been problematic, but that might've been solvable.

A straightforward solution (for your particular use case and in general) would be to define a "simple canonic encoding" that forbids encoding data as strings or sets rather than requiring it. That will lead to larger codes (in particular, every byte above 27 in a stream suddenly needs two bytes to encode), but the implementation become significantly easier.

I think the specification should definitely offer that simplified canonic format, and leave the choice of format to the user. But I want to mull this over a bit before adding this (I have declared the specification as stable, so I could never remove it again).

adzialocha commented 2 years ago

Thank you for this summary!

A straightforward solution (for your particular use case and in general) would be to define a "simple canonic encoding" that forbids encoding data as strings or sets rather than requiring it. That will lead to larger codes (in particular, every byte above 27 in a stream suddenly needs two bytes to encode), but the implementation become significantly easier.

This sounds interesting. So far I can't think of any use case in p2panda where these rules would come into use (we will use maps and our data can not be expressed as strings), having larger codes is not the largest issue I think.

AljoschaMeyer commented 2 years ago

we will use maps

But some human might produce a map that happens to map all keys to nil, and your decoder then must reject if they encoded as a map rather than using the shorter set encoding. Remember that sets do not exist on the logical data level, "set" is simply a shorthand for "map that maps everything to nil", and the encodings just happen to have special support for that kind of map.

Same argument with strings, any array that happens to only consist of integers between zero (inclusively) and 256 (exclusively) can use a special encoding, and must do so in the (efficient) canonic format.

adzialocha commented 2 years ago

But some human might produce a map that happens to map all keys to nil, and your decoder then must reject if they encoded as a map rather than using the shorter set encoding. Remember that sets do not exist on the logical data level, "set" is simply a shorthand for "map that maps everything to nil", and the encodings just happen to have special support for that kind of map.

Yes, though, currently we don't support nil values in p2panda.

Same argument with strings, any array that happens to only consist of integers between zero (inclusively) and 256 (exclusively) can use a special encoding, and must do so in the (efficient) canonic format.

We have arrays of strings, but then I don't know how that would be reflected in vv with a separator between the array fields?

adzialocha commented 2 years ago

FYI: This is a nice CDDL overview of p2panda operations: https://github.com/p2panda/p2panda/blob/main/p2panda-rs/src/cddl/definitions.rs

AljoschaMeyer commented 2 years ago

We have arrays of strings, but then I don't know how that would be reflected in vv with a separator between the array fields?

It doesn't matter what "you have". Some user-defined schema might involve an array of integers. If I supply a value for that, and my array includes the number 1234, I will encode it as an array and all is good. Now suppose my array happened to be [0, 16] instead, and I encoded that as an array. A parser for the efficient canonic encoding has to reject that, because it would have to be encoded as the string consisting of the byte 0 followed by the byte 16 rather than as an array.

I just realized another big problem with canonic serde-based serialization: serde does not guarantee that it serializes struct fields in lexicographic key order, so many Serialize would generate non-canonic codes.

adzialocha commented 2 years ago

It doesn't matter what "you have". Some user-defined schema might involve an array of integers.

Thats still the case for user-defined schemas, as we don't have a array type (as for now at least, maybe later something like an array CRDT). The only arrays we have are holding strings to represent something we call "relations", "pinned relations" or lists of them.

AljoschaMeyer / valuable-value-rs

Canonical implementation limits? #1