Change Encode/Decode to use Read/Write instead of Cursor/Vec

divviup / libprio-rs

Implementation of Prio in Rust.

Mozilla Public License 2.0

98 stars 30 forks source link

Change Encode/Decode to use Read/Write instead of Cursor/Vec #860

Open divergentdave opened 7 months ago

divergentdave commented 7 months ago

I would go further and suggest that we make this library operate on streams of data rather than byte buffers/Cursors; this is a pretty standard interface for an IO library and would allow us to intermingle parsing work with IO work, improving efficiency and lowering memory usage. We couldn't do this before since a stream can fail to read partway through, and we had no way to communicate this to the caller; but if we make these functions fallible, we can communicate stream-failure errors too.

Originally posted by @branlwyd in https://github.com/divviup/libprio-rs/issues/675#issuecomment-1682664520

divergentdave commented 7 months ago

Some thoughts on implementation: we could change encoded_len() to be a required method, returing usize or u64, in order to avoid having to buffer entire vectors before writing their length prefix. On the decoding end, we could use std::io::Read::take() and pass the Take struct to the next decoder. This would eliminate CodecError::LengthPrefixTooBig, and as a result we'd use CodecError::Io with std::io::ErrorKind::UnexpectedEof uniformly for all short reads.

cjpatton commented 7 months ago

(I don't know if this is the right place to pile-on, so feel free to redirect me.) While working on new features for the codec traits, I suggest we consider what would be needed, if anything, to make an implementation of DAP zero-copy. Concretely, the AggregationJobInitReq is potentially large (MBs, say), and it's wasteful to copy each ReportInit into its own buffer before processing it.

divergentdave commented 7 months ago

Depending on how you rules-lawyer "zero-copy", I don't think we can truly do zero-copy decoding, because we always transform field elements into Montgomery representation before working with them, and thus couldn't use a &'a [u8] slice from a message as part of an input to arithmetic operations.

Streamed decoding is also at odds with zero-copy decoding, because things like the zerocopy crate expect the entire message to be slurped into one big buffer.

With all that said, I think we currently achieve "no unnecessary copies" pretty well, and the same could be done using Read. Once the entire message is read into a buffer, the existing Decode helpers split up the slice, construct multiple Cursors using subslices, and the same buffer gets used all the way down to leaf Decode impls. At that point, the implementation copies or decodes from the input buffer to the decoded type. If we switched to using Read instead of Cursor for the input, that would eliminate the first big buffer. Read impls would write directly to stack-allocated buffers in leaf Decode impls. However, library consumers may want to replace the message-wide buffer with a fixed size buffer in std::io::BufReader if that improves performance. I think that question comes down to the Read impl they use, and how expensive many small reads are.

cjpatton commented 7 months ago

You're right that we can't make the system truly zero-copy, but there are probably some heavy hitters we can address. Take HPKE decryption of report shares: A naive implementation would copy each report share into its own buffer, decrypt, and dump the plaintext in another buffer. Ideally we would have a ciphertext of type &mut [u8] and decrypt in place (i.e., the plaintext ends up in a subslice).

tgeoghegan commented 7 months ago

Going all the way back to #166, here was the argument for not doing Read/Write impls at the time:

We might have defined these traits generically in terms of std::io::{Read,Write}, but then those generic parameters would also have to appear all the way up in traits vdaf::{Vdaf, Aggregator}. It also makes more sense given our use case: the TLS wire encoding isn't well suited to streaming encoded objects into a W: Write because variable length objects are encoded as length || object. That means you need to know the length of an object before you can write it somewhere, making it more natural to decode from byte slices and encode into byte vectors. We expect that messages will never be terribly big, meaning it shouldn't be unreasonable to encode them into a buffer on the heap before writing them to a network connection or other byte sink. If it works for a production quality TLS implementation, it should work for us. Finally, if we only ever encode into in-memory buffers, then we can make Encode::encode infallible.

I think that's still true. However I do like the ergonomics of using std::io::{Read, Write}, so I'm happy to see how this turns out.

branlwyd commented 7 months ago

I think a zero-copy implementation without Read would need the underlying memory layout of the type to match what we get off the wire. Where it doesn't match, we need to transform at point of use (...assuming that doesn't count as a copy...), and maybe apply some tricks for performance like caching the result of the transform.

If we use Read, we read into a relatively-small, fixed-size buffer while in process of parsing, but once we parse the bytes on the wire into the in-memory structures we won't need to copy them again; we can move them/pass references as needed. I am not sure if that counts as zero-copy. :)

BTW, what traits are we considering using, concretely? I think std::io::{Read, Write} won't work because they are not async. There are a number of crates offering "async read" traits (e.g. tokio's AsyncRead, future's AsyncRead), but they might tie us too tightly to one particular implementation library -- I suppose the higher-level application would have to use the same traits, or provide adapters.

edit: see https://github.com/nrc/portable-interoperable/issues/5 for the current state of standard "async read/write" traits -- tl;dr is that it's not done yet

divergentdave commented 7 months ago

My opinion is that we shouldn't try to design the Decode trait to accomodate in-place decryption. This would require using &mut [u8] slices throughout, though only a couple message types (defined outside this library) would perform in-place modifications. Plus, it would block decoding other types of messages in a streaming fashion, because in-place decryption requires us to provide messages in contiguous buffers (at least, assuming we're using the hpke crate).

Instead, I think it would make sense for DAP aggregator implementations to use a separate trait or separate open-coded decoding routines for top level messages (relying on split_at_mut()), perform the in-place decryption on the HpkeCiphetext payload, and only then start calling into Decode impls.