P3KI / bendy

A rust library for encoding and decoding bencode with enforced cannonicalization rules.
BSD 3-Clause "New" or "Revised" License
70 stars 13 forks source link

Serde Derive Support #28

Closed casey closed 4 years ago

casey commented 4 years ago

I've been using serde-bencode, but I ran into some issues, so I'm reconsidering my choice of library.

Bendy looks great, but I'm making heavy use of serde derived serialization and deserialization, which I think Bendy doesn't do.

Is this a feature which bendy might eventually support?

thequux commented 4 years ago

Thanks for your interest!

While serde is not a high priority for our own usage, it should be fairly straightforward to implement the necessary traits, and we'd happily work with you to add serde support. I predict that it would be well under a week of work, probably only a day or two.

The main reason that it doesn't exist yet is that bencode requires that dictionaries be stored sorted by key, which serde doesn't guarantee. This means that any time a dictionary is output, it would need to be buffered, and our use case for bendy needs to work on extremely memory-constrained devices which precludes our own internal use of serde. However, because we do occasionally need the functionality, there's already support for emitting unsorted dictionaries with the necessary buffering.

casey commented 4 years ago

Thank you very much for the information!

That all sounds good, I think it's likely that I'll take a crack at it. When I inevitably run into problems with the implementation, I'll let you know :)

casey commented 4 years ago

I just started implementing Serde serialization, and there's a bit of an impedance mismatch between Serde's trait-based serialization and the bendy's callback-based encoding for compound objects, for example sequences/lists.

When serializing a sequence, serde first calls Serializer::serialize_seq, which should return an object that implements serde::ser::SerializeSeq. SerializeSeq::serialize_element(&mut self, value: &impl Serialize) is then called on this object with each value to serialize, and then finally SerializeSeq::end(self).

This is somewhat at odds with bendy's Encoder::emit_list. Since emit_list requires a callback which, when called, must pass all values to be serialized in sequence to an encoder, all values must be ready before emit_list is called. However, Serializer::serialize_seq does not provide the values up front.

I don't have any particularly elegant ideas for how to resolve this. A decidedly non-elegant possibility is to add something like bendy::encoding::Encoder::begin_list() -> ListEncoder, which returns a new type ListEncoder, upon which emit(value) can be called to emit values, in addition to end(), which consumes the ListEncoder and emits the closing e. Between Encoder::begin_list and ListEncoder::end, doing anything with the original Encoder would be verboten and a runtime error. However, this doesn't have the nice static guarantees of the original callback based-system, and somewhat complicates the interface.

What are your thoughts?

thequux commented 4 years ago

Yeah, those two interfaces do seem to be incompatible :-(

My first reaction is to have a parallel set of encoding structs for Serde that use Encoder::emit_token internally. This will still give you Bendy's structure checking, but it will give you free rein to export an interface compatible with Serde without marring the safe bendy-native interface.

That said, it looks like you're working in a separate crate, so you don't have access to the emit_token function. The best option that I can see is to instead add your serde support to bendy itself, enabled by a feature flag.

If you are committed to working in a separate crate, then the next best option is to add a type called TokenEncoder to Bendy that exposes the emit_token function while enforcing that it is used to emit a valid structure. This would leave you responsible for sorting dictionaries, but you can use the code in UnsortedDictEncoder as an example for that; the code is fairly straightforward. If this is something you'd like, I should be able to have it for you by Monday evening.

If at all possible, though, I don't want to expose a ListEncoder like you described from the statically safe interface that already exists. Aside from adding an attractive nuisance (begin_list seems like the obvious method to write a list, but it would be the one that's easier to misuse), it would also add a new way to misuse the API that is totally avoidable. (Right now, there are only three ways to get an error out of bendy's encoder: return it yourself, nest the structure too deep, or do nothing with a SingleItemEncoder)

casey commented 4 years ago

Thanks for the feedback!

Using emit_token sounds like a good option. I don't have any particular reason to work in a separate crate, so I'll switch to adding serde support directly to bendy under a feature flag.

casey commented 4 years ago

I ran into an interesting issue replacing serde_bencode in a personal project with my in-progress fork of bendy with serde suport.

serde_bencodes serializes None as the empty string, and Some(x) as the bencode for x. This is leveraged in struct serialization and deserialization to skip optional fields with None values. So Struct {field: None} serializes as de, and Struct{field: Some(1)} serializes as d5:fieldi1ee.

This is a bit odd, but it does mean that a user can derive Serialize and Deserialize on a struct with optional fields, and the derived implementations will be compatible with bittorrent metainfo, a common use for bencode, which skips optional fields:

#[derive(Serialize, Deserialize)]
struct Metainfo {
  announce: String, // non optional announce URL
  comment: Option<String>, // optional comment field
  // more fields...
}

I happen to be using serde_bencode to serialize and deserialize bittorrent metadata, so this made serde_bencode very convenient.

However, it's a bit weird. Some(()) and None both serialize to the empty string. (Unit is also represented with the empty string.). A Vec<Option<T>> can round-trip to a value with fewer values, since the None values will be skipped when serializing and so won't be present when deserializing. I think this also causes some of the other issues I've had with serde_bencode.

All this made me reconsider how best to map bencode to and from Serde's data model. I can think of a few ways to move forward:

  1. Replicate serde_bencode's behavior
  2. Only support native bencode types
  3. Support all serde data model types

Replicate serde_bencode's behavior

Pros:

Cons:

Only support native bencode types

This would make Serialize and Deserialize panic on types without native bencode encodings, e.g. Option, (), f32, et al.

Pros:

Cons:

Support all serde data model types

Use a reasonable encoding for all serde data model types, using the empty list as a nil value. Some examples:

Pros:

Cons:

Summary

The first option is convenient, but I think producing invalid bencode for some types is a non starter, and I don't like the idea that certain values don't round-trip.

Between the last two options, I'm personally leaning towards the latter, coming up with lossless mappings for all serde data model types.

What do you think?

casey commented 4 years ago

I just opened #31 with an initial implementation with support only for bencode-native types. I figured that it would be good to keep the initial PR manageable.

Stuff for future PR(s):

casey commented 4 years ago

I just opened PR #32, which extends the initial serde support to allow serializing and deserializing from all types in the serde data model.

casey commented 4 years ago

@thequux / @0ndorio, thank you very much for your help and reviews!

The existing encoder and decoder caught potential bugs multiple times by refusing to emit and consume invalid bencode, which made implementing this much easier and more fun than it would have been otherwise.