Implement io::Read + io::Seek for hdf5::Reader

Rikorose commented 2 years ago

I have some encoded data stored within a hdf5::Dataset. Currently I'm using:

let ds = self.group()?.dataset("some_key")?;
let encoded = ds.read_1d()?;
let mut reader = SomeDecoderThatNeedsIORead::new(Cursor::new(encoded.as_slice().unwrap()))?;

or if I only want to read a meta data header:

let encoded = ds.read_slice_1d(s![..1024]) // I need to know the header-size before hand. This may vary.
let mut reader = SomeDecoderThatNeedsIORead::new(Cursor::new(encoded.as_slice().unwrap()))?;

It would be nice to just do:

let mut reader = SomeDecoderThatNeedsIORead::new(ds);

I guess std::io::Read should be simpler to implement, but Seek would also be nice.

In the first place, an uncompressed dataset (i.e. no gzip etc) would be sufficient since compression does not bring any benefit for already encoded data. Not sure if this makes a difference.

mulimoen commented 2 years ago

That would be very nice for 1D arrays. How would multi-dimensional arrays work? Are they expected to be read in C-order? How should Seek work if a dimension is changed in the dataset?

Rikorose commented 2 years ago

I guess it only makes sense for 1d arrays since the decoder should take care of the rest, no?

mulimoen commented 2 years ago

If you could make a prototype showing how io::Read could be implemented that would be great. Here are some more difficulties: Which datatypes do you expect to be able to work on? Is f64 serialised or invalid? Should text be Read-able?

Rikorose commented 2 years ago

io::Read works on byte level. Thus only u8 will be supported. https://doc.rust-lang.org/std/io/trait.Read.html

aldanor commented 2 years ago

Don't think this is possible:

let mut reader = SomeDecoderThatNeedsIORead::new(ds);

– because the reader needs to store state; all that the dataset (and all other HDF5 objects) stores is the HDF5 integer id and nothing else. So, best case scenario it would be something like

let mut reader = SomeDecoderThatNeedsIORead::new(ds.byte_reader());

Another thing to be careful about is threading, most ffi operations are being locked by a shared mutex, so what happens if you open two readers in two parallel threads?

mulimoen commented 2 years ago

A wrapper is indeed needed to store a pointer to the position in the dataset. I would not expect threading to be a problem, apart from it being blocking if reading from two parallel threads at once.

Rikorose commented 2 years ago

Hi all,

I made a first prototype, but I am not quite sure about the threading/mutex stuff. Do we need a lock for ptr, so ByteReader can be Send/Sync? Or do we need to add a lock to reader? I guess one could create multiple ByteReader so we don't need a mutex for ptr. My assumption was that we can have multiple readers (or references to the same reader) for the same dataset as long as there is a maximum of one writer? I don't know the hdf5 sys library threading guarantees.

// container.rs
// ...

pub struct ByteReader<'a> {
    ptr: usize,
    reader: &'a Reader<'a>,
}

impl<'a> std::io::Read for ByteReader<'a> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        // TODO: Better error handling
        // TODO: How to handle threadding? Do we need a lock like:
        // let mut lock = self.ptr.lock().unwrap();
        // let ptr = lock.deref_mut();
        // TODO: Add reader method that uses the allocation provided by buf
        let tmp = self.reader.read_slice_1d(self.ptr..self.ptr + buf.len()).unwrap();
        // TODO: New reader method should not return an error slice is larger than the dataset.
        // Rather we just return the number of read bytes which may be less than buf.len()
        let n_read = tmp.len();
        self.ptr += n_read;
        buf.clone_from_slice(tmp.as_slice().unwrap());
        Ok(n_read)
    }
}

aldanor commented 2 years ago

There's tons of stuff like building HDF5 dataspaces and selections, verifying dimensionality and preallocating buffers in read_slice_1d. This will be very inefficient for a direct allocation-less byte reader. Ideally, all (most) of that stuff should go into ByteReader constructor, so that in the read function you would only have this part, and maybe a few minor things around it: https://github.com/aldanor/hdf5-rust/blob/c3bd2960090a8f10098311158786309f511ccc73/src/hl/container.rs#L57-L59
All read_* methods as they stand right now are already behind locks (i.e., the FFI calls themselves, like H5Dread). If that's acceptable performance-wise, then it can stay like this.
Most error handling (except for the read call itself) can be done in the constructor (see 1 above).

Rikorose commented 2 years ago

There's tons of stuff like building HDF5 dataspaces and selections, verifying dimensionality and preallocating buffers in read_slice_1d. This will be very inefficient for a direct allocation-less byte reader. Ideally, all (most) of that stuff should go into ByteReader constructor, so that in the read function you would only have this part, and maybe a few minor things around it:

I know it is inefficient. This is what I meant with the todo Add reader method that uses the allocation provided by buf

Thanks for your clarification and guidance.

Rikorose commented 2 years ago

Next try:

pub struct ByteReader {
    // TODO: Do I need to hold a reference to &Container?
    ptr: usize,
    obj_space: Dataspace,
    obj_id: hid_t,
    tp_id: hid_t,
    xfer: PropertyList,
}

impl ByteReader {
    pub fn new(obj: &Container) -> Result<Self> {
        ensure!(!obj.is_attr(), "ByteReader cannot be used on attribute datasets");

        let file_dtype = obj.dtype()?;
        let mem_dtype = Datatype::from_type::<u8>()?;
        file_dtype.ensure_convertible(&mem_dtype, Conversion::NoOp)?;

        let obj_space = obj.space()?;
        let (obj_id, tp_id) = (obj.id(), mem_dtype.id());
        let xfer = PropertyList::from_id(h5call!(H5Pcreate(*crate::globals::H5P_DATASET_XFER))?)?;
        if !hdf5_types::USING_H5_ALLOCATOR {
            crate::hl::plist::set_vlen_manager_libc(xfer.id())?;
        }
        Ok(ByteReader { ptr: 0, obj_space, obj_id, tp_id, xfer })
    }
}

impl std::io::Read for ByteReader {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let selection = Selection::new(self.ptr..self.ptr + buf.len());
        let out_shape = selection.out_shape(&self.obj_space.shape()).unwrap();
        let fspace = self.obj_space.select(selection).unwrap();
        let mspace = Dataspace::try_new(&out_shape).unwrap();
        match h5call!(H5Dread(
            self.obj_id,
            self.tp_id,
            mspace.id(),
            fspace.id(),
            self.xfer.id(),
            buf.as_mut_ptr() as *mut libc::c_void
        )) {
            Ok(_) => (),
            Err(e) => return Err(std::io::Error::new(std::io::ErrorKind::Other, e.to_string())),
        }
        Ok(out_shape[0])
    }
}

aldanor commented 2 years ago

Do I need to hold a reference to &Container?

I think you have to. Otherwise whoever holds the container can just drop it and you're left with an invalid object.
tp_id: hid_t,

This has to be an object as well, otherwise you have a leak (your byte reader is dropped but the dtype it allocated will stay alive forever).
All those unwraps will obviously have to be cleaned up
Container may be an attribute, not just a dataset, whereas you're assuming it's a dataset only?

Rikorose commented 2 years ago

I think you have to. Otherwise whoever holds the container can just drop it and you're left with an invalid object. This has to be an object as well, otherwise you have a leak (your byte reader is dropped but the dtype it allocated will stay alive forever).

Sure, makes sense.

All those unwraps will obviously have to be cleaned up

Yes, I will think about proper std::io::error types

Container may be an attribute, not just a dataset, whereas you're assuming it's a dataset only?

I assumed, io::Read only makes sense for dataset (which I check in the constructor). This is essentially equivalent with read_slice. I don't think an attribute can hold a byte array?

mulimoen commented 2 years ago

Do I need to hold a reference to &Container?

You can clone the container to make the lifetime easier. This would handle cleaning up the resources after use.

assumed, io::Read only makes sense for dataset (which I check in the constructor). This is essentially equivalent with read_slice. I don't think an attribute can hold a byte array?

Attributes supports byte arrays, both fixed and variable length. Unfortunately you would have to read the entire attribute at once to get any data.

aldanor / hdf5-rust

Implement io::Read + io::Seek for hdf5::Reader #193