enarx / ciborium

CBOR utilities
Other
256 stars 58 forks source link

Exposing Serializer in the public API #18

Open trepetti opened 3 years ago

trepetti commented 3 years ago

Hi all,

I saw that Nathaniel mentioned the possibility of exposing the Serializer in the public API in https://github.com/pyfisch/cbor/issues/179#issuecomment-738930258. I was wondering what steps or contributions you would be looking for getting it into a state where that would be a possibility. Not sure how many other people share my use-case, but I have been using Serializer::serialize_seq() for outputting logs to disk (with some intervening compression) with serde_json and was curious about moving to CBOR. I know there is more involved than just dropping pub in front of https://github.com/enarx/ciborium/blob/06cafa7130be35b4c8b56d170d11b73b9b858ce5/ciborium/src/ser/mod.rs#L16 I can help by coming up with doc strings for rustdoc to put it on par with serde_json's treatment of Serializer if that would help.

Let me know what you think and thanks for all your work on this crate.

-Tom

npmccallum commented 3 years ago

@trepetti Basically, I find my current approach to the different types to be not something I want to stabilize yet. I see two approaches:

  1. We can clean up the Serializer and Deserializer types and expose them as stable.

  2. We can add a feature flag like unstableapis which exposes them. This would allows us to get the types public ASAP but without committing to any stability guarantee.

trepetti commented 3 years ago

@npmccallum Thank you for laying out these options. For the first option, would making an API that focuses on analogy to serde_json be a direction you would be interested in going in? I implemented a minimally modified, proof of concept public Serializer that still keeps the encoding provided courtesy of ciborium-ll inaccessible from the top-level Serializer again by analogy to serde_json. My other goal was to make sure that all the signatures in serde trait implementations match serde 1:1 as written (even if it doesn't affect the implementation in practice): https://github.com/trepetti/ciborium/commit/1c01b99647be3ea4f112458eee537d5ff301ee06

If this isn't the direction you wanted to go in towards a stable API, I completely understand. And regardless, I think putting it initially behind a feature flag would be prudent (have not done that in this POC, though). It looks like there is no way to have the CollectionSerializer of ciborium as pub(crate), but it appears that serde_json had this same issue with their corresponding Compound type and just opted for #[doc(hidden)] as well.

Let me know what you think. I would implement the corresponding Deserializer API changes and open a draft PR based on your feedback.

quininer commented 3 years ago

erased_serde also depend on Serializer and Deserializer, exposing it would be useful.

trepetti commented 3 years ago

In looking at the Deserializer side of things, I definitely get why Enarx would not want to make assumptions about how/where scratch is allocated for embedded scenarios, although it does present a challenge for implementing Deserializer::{new, from_bytes, from_str, from_reader} in exactly the same way as serde_json. I see a few ways to approach this:

  1. We could maintain the flexibility in the choice of allocation strategy by modifying these functions with an extra argument for scratch, which I think would be the cleanest option.

  2. Alternatively, we could define

    enum Scratch {
    Preallocated(&'b mut [u8]),
    Dynamic(Box<[u8]>),
    }

    and modify everywhere that references scratch accordingly to match on the enum.

  3. We could convert scratch to a Vec as serde_json does since de/mod.rs is already pulling in std::alloc::Box and std::alloc::String anyway. We could default to a 4 kiB with_capacity starting out to reduce runtime allocation overhead.

Let me know what you think. Since Serializer doesn't need any additional memory beyond a few conversions to String, we can keep it 1:1 with serde_json as in the draft above, but for Deserializer there is that large amount of scratch memory needed, so it is worth thinking through what you'd want it to end up looking like in terms of the allocation strategy.

npmccallum commented 3 years ago

@trepetti Apologies for the lack of response. I was on vacation. :)

I'm fine with the first approach too. The scratch argument definitely needs to be carefully documented.

trepetti commented 3 years ago

@npmccallum Hope you had a good vacation! One last pair of questions before I put together a PR: if we expose scratch, is it also worth exposing the recursion limit, too? For either or both we could have pub consts DEFAULT_SCRATCH_SIZE and DEFAULT_RECURSION_LIMIT exported, if you think that makes sense, based on the values used in de::from_reader(). That way in addition to explaining the purposes of both arguments in the documentation, users could stick with sensible defaults without having to think about it too much if they don't want to.

npmccallum commented 3 years ago

@trepetti We should probably just work through these details in the PR review.

jonasbb commented 3 years ago

A couple of serde helper libraries also require the Serializer/Deserializer types to be exposed:

  1. We can clean up the Serializer and Deserializer types and expose them as stable.

What would be required to clean them up? I did not see that yet in the thread.

npmccallum commented 3 years ago
  1. We can clean up the Serializer and Deserializer types and expose them as stable.

What would be required to clean them up? I did not see that yet in the thread.

They currently leverage some internal generic types which are intended to promote code sharing, but probably aren't a good public interface. These would have to be reworked to something more reasonable.

jonasbb commented 3 years ago

Thanks for the info. The internal generic types could still remain private and only the Serializer and Deserializer types made public. I imagine something like this: https://github.com/jonasbb/ciborium/commit/4324dc3791fa638ada4c6d059b42e7d98aca34e1

For the Serializer the only constraint is to also expose CollectionSerializer.

For the Deserializer the scratch space "leaks" because of the lifetime. You can abstract it away and provide a constructor for a Deserializer<'static, R>, without requiring the user to know anything about the scratch space. I made a Scratch struct, which abstracts over borrowed or owned scratch space. This is like the 2. option mentioned by @trepetti. The 1. option can always be added by providing new construction functions for Deserializer.

trepetti commented 3 years ago

I apologize for dropping the ball on this.

I have been using this initial attempt at exposing Serializer and have been using SerializeSeq for streaming serialization in one of my own projects successfully, but it needed more synthetic tests done in-tree before opening it as a PR: https://github.com/trepetti/ciborium/commits/public-deserializer-and-serializer. I can take a break from some prose writing tonight to add those tests tonight and possibly break it into a separate PR for the simpler Serializer case, if that would be better. I modified some of the naming conventions in the existing unexposed traits to make side-by-side comparison with serde and serde_json easier but can also revert those if you want a cleaner commit history.

On the deserialization side, this branch is a reasonably well fleshed-out attempt at what going with option 1 for Deserializer might look like: forcing any scratch space to be preallocated and forcing a recursion limit to be known ahead-of-time in the constructor, which are possibly nice to be explicit about in the resource constrained use case.I expose some sensible defaults for these as constants in the top-level API as de::DEFAULT_SCRATCH_SIZE and de::DEFAULT_RECURSION_LIMIT. Again, I will take a stab at finishing some more tests tonight.

In going further to flesh out the equivalent of StreamDeserializer to allow feature parity with serde_json, I ran into the fact that there would need to be some changes in ciborium-ll in order to support the equivalent of serde_json's Deserializer::end(). We would have to expose part of the decoder in the ciborium-ll to report the impending end of an I/O stream. This is where it gets tricky because I am a security novice and wasn't sure how much would/could/should be exposed in a potential interface to the inner workings of ciborium-ll or if this is even a valid concern in the first place. The larger project is sensitive about this, so I thought before moving forward with StreamDeserializer and modifying ciborium-ll, it would be best to submit a PR omitting Deserializer::end() and consequently Deserializer::into_iter() and StreamDeserializer to flesh it out further at a later time if that is something there is also interest in: https://github.com/trepetti/ciborium/blob/dffbe59a2afc0ae67715be45b12dd52a9ce5779e/ciborium/src/de/mod.rs#L653-L659

jonasbb commented 3 years ago

@trepetti I'm not sure where the best place to post feedback is, so just place it here to keep the discussion in one place. Your changes already look quite complete. I didn't spot any obvious problems in them. I do have two comments:

  1. Given that ciborium::de::from_reader exists without taking any scratch or recursion arguments, I think it might be worthwhile to have a ciborium::de::Deserializer::??? which also does not require those. To achieve that 2. might need to be implemented.
  2. Types which do not borrow are often more convenient to use, for example when returning them from a function or storing somewhere. This would require dropping the scratch requirement or having the option of an owned type like enum Scratch above.
pickfire commented 2 years ago

I am also looking into using the public Serializer and Deserializer for https://github.com/pickfire/babelfish which uses serde-transcode given that serde_cbor is now unmaintained, can't really do a switch from serde_cbor since it does not have Deserializer::from_reader.

dominikwerder commented 1 year ago

erased_serde depends on having access to the Serializer