bincode-org / bincode

A binary encoder / decoder implementation in Rust.
MIT License
2.63k stars 265 forks source link

`get_byte_buffer` equivalent in bincode 2? #679

Closed ruihe774 closed 10 months ago

ruihe774 commented 10 months ago

Hi. I am interested in bincode 2 and am trying migrating from v1. A difficult point is that I do not find an equivalent of BincodeRead::get_byte_buffer() in v2. I use it for zero-copy deserialization.

You might think, anyway, we need to first allocate a Vec<u8> then read data into it when using get_byte_buffer, hence there will be no difference compared to Reader::read() in v2, which fills data into a provided &[u8]. However, this is not the case when we are memory mapping a file and using a custom global allocator which can 1) ignore deallactions in that heap regions and latter free them in batch, or 2) can do partial deallocation. In my use case, I directly construct Vec<u8> (unsafely) from slices within the ignored heap regions and pass its ownership to deserializer in get_byte_buffer. The Vec<u8> is then passed to serve_bytes::ByteBuf or String in my data structures. There is zero copy in this deserializing process.

My use case may be tricky. However, as the allocator API is getting stabilized, we will be able to do it in a safer way.

I am wondering whether it would be better to provide a similar API in v2.

VictorKoenders commented 10 months ago

We're thinking of switching to the read_buf API (tracking issue, RFC) for this, once that is stabilized. Does that cover your use-case?

ruihe774 commented 10 months ago

We're thinking of switching to the read_buf API (tracking issue, RFC) for this, once that is stabilized. Does that cover your use-case?

Unfortunately, no. read_buf is to avoid initialization. Maybe I did not describe it clearly. I'd like to provide some code samples.

For example, I have a struct:

#[derive(Serialize, Deserialize)]
struct MyData {
    // Serde will treat Vec<u8> as an array.
    // So, we need `serde_bytes` here.
    // Or, we can use String as an example.
    #[serde(with = "serde_bytes")]
    bytes: Vec<u8>,
    // ..other fields
}

I memory-map a file into memory:

let mmap = memmap2::MmapMut::map_mut(&file)?;
let ptr = mmap.as_mut_ptr();
let len = mmap.len();
let reader = MyReader { ptr, len };
std::mem::forget(mmap);

Then I implement a custom BincodeReader:

struct MyReader {
    ptr: *mut u8,
    len: usize,
}

impl<'a> BincodeRead<'a> for MyReader {
    fn get_byte_buffer(&mut self, length: usize) -> bincode::Result<Vec<u8>> {
        if self.len < length {
            // ...error handling stuff
        }
        // construct the Vec just on the mapped memory
        let vec = unsafe { Vec::from_raw_parts(self.ptr, length, length) };
        self.ptr = unsafe { self.ptr.add(amt) };
        self.len -= amt;
        Ok(vec)
    }

    // ...other methods
}

Finally,

let data: MyData = bincode::deserialize_from_custom(reader);

This will work if we are using a custom global allocator that ignores deallocations in mapped heap region (otherwise it will corrupt). You can see that data.bytes will right point to the mapped memory: it is completely copy-free, and the copying from kernel file buffer to userspace is also eliminated.

VictorKoenders commented 10 months ago

I think I understand what you want for your use case.

Unfortunately the bincode trait can't return Vec directly any more because we want to make embedded systems a first class citizen for bincode. I don't see an easy way to add a specialized function that returns a Vec only if alloc is enabled, this sounds like it'd break horribly if bincode is somewhere in a complex dependency tree.

It would be nice if we could do something like

trait CustomAllocator {
    fn allocate_vec_in_place(...);
}

impl<T, A> bincode::Decode for Vec<T, A>
    where T: bincode::Decode,
    A: CustomAllocator
{
     // ...
}

impl<T> bincode::Decode for Vec<T, Global>
    where T: bincode::Decode
{
     // ...
}

But that

  1. sounds like it needs implementation specialization
  2. sounds very specific to your use case and not something we can implement globally.

For now I think the best solution would be to have your own Vec wrapper:

struct CustomAllocVec<T>(Vec<T, YourAllocator>);

impl<T: Decode> Decode for CustomAllocVec<T> {
    // you can do your custom logic here
}

But that would require having control over all vecs in your dependency tree

ruihe774 commented 10 months ago

But that

  1. sounds like it needs implementation specialization
  2. sounds very specific to your use case and not something we can implement globally.

Yes, you're right. Maybe we could do it with something that is similar to #[serde(with="...")]. It would be nice if we could achieve proxying using bincode's encode and decode, e.g.:

#[derive(Encode, Decode)]
struct MyData {
    #[bincode(decode_with="zero_copy_decode_vec")]
    bytes: Vec<u8>,
}

It is somewhat out of current topic, though. And we can still do it by implementing custom decode for the whole struct or with bincode::serde.