Open divergentdave opened 7 months ago
Some thoughts on implementation: we could change encoded_len()
to be a required method, returing usize
or u64
, in order to avoid having to buffer entire vectors before writing their length prefix. On the decoding end, we could use std::io::Read::take()
and pass the Take
struct to the next decoder. This would eliminate CodecError::LengthPrefixTooBig
, and as a result we'd use CodecError::Io
with std::io::ErrorKind::UnexpectedEof
uniformly for all short reads.
(I don't know if this is the right place to pile-on, so feel free to redirect me.) While working on new features for the codec traits, I suggest we consider what would be needed, if anything, to make an implementation of DAP zero-copy. Concretely, the AggregationJobInitReq
is potentially large (MBs, say), and it's wasteful to copy each ReportInit
into its own buffer before processing it.
Depending on how you rules-lawyer "zero-copy", I don't think we can truly do zero-copy decoding, because we always transform field elements into Montgomery representation before working with them, and thus couldn't use a &'a [u8]
slice from a message as part of an input to arithmetic operations.
Streamed decoding is also at odds with zero-copy decoding, because things like the zerocopy
crate expect the entire message to be slurped into one big buffer.
With all that said, I think we currently achieve "no unnecessary copies" pretty well, and the same could be done using Read
. Once the entire message is read into a buffer, the existing Decode
helpers split up the slice, construct multiple Cursor
s using subslices, and the same buffer gets used all the way down to leaf Decode
impls. At that point, the implementation copies or decodes from the input buffer to the decoded type. If we switched to using Read
instead of Cursor
for the input, that would eliminate the first big buffer. Read
impls would write directly to stack-allocated buffers in leaf Decode
impls. However, library consumers may want to replace the message-wide buffer with a fixed size buffer in std::io::BufReader
if that improves performance. I think that question comes down to the Read
impl they use, and how expensive many small reads are.
You're right that we can't make the system truly zero-copy, but there are probably some heavy hitters we can address. Take HPKE decryption of report shares: A naive implementation would copy each report share into its own buffer, decrypt, and dump the plaintext in another buffer. Ideally we would have a ciphertext of type &mut [u8]
and decrypt in place (i.e., the plaintext ends up in a subslice).
Going all the way back to #166, here was the argument for not doing Read/Write
impls at the time:
We might have defined these traits generically in terms of
std::io::{Read,Write}
, but then those generic parameters would also have to appear all the way up in traitsvdaf::{Vdaf, Aggregator}
. It also makes more sense given our use case: the TLS wire encoding isn't well suited to streaming encoded objects into aW: Write
because variable length objects are encoded aslength || object
. That means you need to know the length of an object before you can write it somewhere, making it more natural to decode from byte slices and encode into byte vectors. We expect that messages will never be terribly big, meaning it shouldn't be unreasonable to encode them into a buffer on the heap before writing them to a network connection or other byte sink. If it works for a production quality TLS implementation, it should work for us. Finally, if we only ever encode into in-memory buffers, then we can makeEncode::encode
infallible.
I think that's still true. However I do like the ergonomics of using std::io::{Read, Write}
, so I'm happy to see how this turns out.
I think a zero-copy implementation without Read
would need the underlying memory layout of the type to match what we get off the wire. Where it doesn't match, we need to transform at point of use (...assuming that doesn't count as a copy...), and maybe apply some tricks for performance like caching the result of the transform.
If we use Read
, we read into a relatively-small, fixed-size buffer while in process of parsing, but once we parse the bytes on the wire into the in-memory structures we won't need to copy them again; we can move them/pass references as needed. I am not sure if that counts as zero-copy. :)
BTW, what traits are we considering using, concretely? I think std::io::{Read, Write}
won't work because they are not async. There are a number of crates offering "async read" traits (e.g. tokio's AsyncRead
, future's AsyncRead
), but they might tie us too tightly to one particular implementation library -- I suppose the higher-level application would have to use the same traits, or provide adapters.
edit: see https://github.com/nrc/portable-interoperable/issues/5 for the current state of standard "async read/write" traits -- tl;dr is that it's not done yet
My opinion is that we shouldn't try to design the Decode
trait to accomodate in-place decryption. This would require using &mut [u8]
slices throughout, though only a couple message types (defined outside this library) would perform in-place modifications. Plus, it would block decoding other types of messages in a streaming fashion, because in-place decryption requires us to provide messages in contiguous buffers (at least, assuming we're using the hpke
crate).
Instead, I think it would make sense for DAP aggregator implementations to use a separate trait or separate open-coded decoding routines for top level messages (relying on split_at_mut()
), perform the in-place decryption on the HpkeCiphetext
payload, and only then start calling into Decode
impls.
Originally posted by @branlwyd in https://github.com/divviup/libprio-rs/issues/675#issuecomment-1682664520