Closed Rikorose closed 2 years ago
That would be very nice for 1D arrays. How would multi-dimensional arrays work? Are they expected to be read in C-order? How should Seek
work if a dimension is changed in the dataset?
I guess it only makes sense for 1d arrays since the decoder should take care of the rest, no?
If you could make a prototype showing how io::Read
could be implemented that would be great. Here are some more difficulties: Which datatypes do you expect to be able to work on? Is f64
serialised or invalid? Should text be Read
-able?
io::Read
works on byte level. Thus only u8 will be supported. https://doc.rust-lang.org/std/io/trait.Read.html
Don't think this is possible:
let mut reader = SomeDecoderThatNeedsIORead::new(ds);
– because the reader needs to store state; all that the dataset (and all other HDF5 objects) stores is the HDF5 integer id and nothing else. So, best case scenario it would be something like
let mut reader = SomeDecoderThatNeedsIORead::new(ds.byte_reader());
Another thing to be careful about is threading, most ffi operations are being locked by a shared mutex, so what happens if you open two readers in two parallel threads?
A wrapper is indeed needed to store a pointer to the position in the dataset. I would not expect threading to be a problem, apart from it being blocking if reading from two parallel threads at once.
Hi all,
I made a first prototype, but I am not quite sure about the threading/mutex stuff. Do we need a lock for ptr, so ByteReader can be Send/Sync? Or do we need to add a lock to reader? I guess one could create multiple ByteReader so we don't need a mutex for ptr. My assumption was that we can have multiple readers (or references to the same reader) for the same dataset as long as there is a maximum of one writer? I don't know the hdf5 sys library threading guarantees.
// container.rs
// ...
pub struct ByteReader<'a> {
ptr: usize,
reader: &'a Reader<'a>,
}
impl<'a> std::io::Read for ByteReader<'a> {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
// TODO: Better error handling
// TODO: How to handle threadding? Do we need a lock like:
// let mut lock = self.ptr.lock().unwrap();
// let ptr = lock.deref_mut();
// TODO: Add reader method that uses the allocation provided by buf
let tmp = self.reader.read_slice_1d(self.ptr..self.ptr + buf.len()).unwrap();
// TODO: New reader method should not return an error slice is larger than the dataset.
// Rather we just return the number of read bytes which may be less than buf.len()
let n_read = tmp.len();
self.ptr += n_read;
buf.clone_from_slice(tmp.as_slice().unwrap());
Ok(n_read)
}
}
read_slice_1d
. This will be very inefficient for a direct allocation-less byte reader. Ideally, all (most) of that stuff should go into ByteReader
constructor, so that in the read function you would only have this part, and maybe a few minor things around it:
https://github.com/aldanor/hdf5-rust/blob/c3bd2960090a8f10098311158786309f511ccc73/src/hl/container.rs#L57-L59read_*
methods as they stand right now are already behind locks (i.e., the FFI calls themselves, like H5Dread
). If that's acceptable performance-wise, then it can stay like this.
- There's tons of stuff like building HDF5 dataspaces and selections, verifying dimensionality and preallocating buffers in
read_slice_1d
. This will be very inefficient for a direct allocation-less byte reader. Ideally, all (most) of that stuff should go intoByteReader
constructor, so that in the read function you would only have this part, and maybe a few minor things around it:
I know it is inefficient. This is what I meant with the todo Add reader method that uses the allocation provided by buf
Thanks for your clarification and guidance.
Next try:
pub struct ByteReader {
// TODO: Do I need to hold a reference to &Container?
ptr: usize,
obj_space: Dataspace,
obj_id: hid_t,
tp_id: hid_t,
xfer: PropertyList,
}
impl ByteReader {
pub fn new(obj: &Container) -> Result<Self> {
ensure!(!obj.is_attr(), "ByteReader cannot be used on attribute datasets");
let file_dtype = obj.dtype()?;
let mem_dtype = Datatype::from_type::<u8>()?;
file_dtype.ensure_convertible(&mem_dtype, Conversion::NoOp)?;
let obj_space = obj.space()?;
let (obj_id, tp_id) = (obj.id(), mem_dtype.id());
let xfer = PropertyList::from_id(h5call!(H5Pcreate(*crate::globals::H5P_DATASET_XFER))?)?;
if !hdf5_types::USING_H5_ALLOCATOR {
crate::hl::plist::set_vlen_manager_libc(xfer.id())?;
}
Ok(ByteReader { ptr: 0, obj_space, obj_id, tp_id, xfer })
}
}
impl std::io::Read for ByteReader {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
let selection = Selection::new(self.ptr..self.ptr + buf.len());
let out_shape = selection.out_shape(&self.obj_space.shape()).unwrap();
let fspace = self.obj_space.select(selection).unwrap();
let mspace = Dataspace::try_new(&out_shape).unwrap();
match h5call!(H5Dread(
self.obj_id,
self.tp_id,
mspace.id(),
fspace.id(),
self.xfer.id(),
buf.as_mut_ptr() as *mut libc::c_void
)) {
Ok(_) => (),
Err(e) => return Err(std::io::Error::new(std::io::ErrorKind::Other, e.to_string())),
}
Ok(out_shape[0])
}
}
Do I need to hold a reference to &Container?
I think you have to. Otherwise whoever holds the container can just drop it and you're left with an invalid object.
tp_id: hid_t,
This has to be an object as well, otherwise you have a leak (your byte reader is dropped but the dtype it allocated will stay alive forever).
I think you have to. Otherwise whoever holds the container can just drop it and you're left with an invalid object. This has to be an object as well, otherwise you have a leak (your byte reader is dropped but the dtype it allocated will stay alive forever).
Sure, makes sense.
All those unwraps will obviously have to be cleaned up
Yes, I will think about proper std::io::error types
Container may be an attribute, not just a dataset, whereas you're assuming it's a dataset only?
I assumed, io::Read only makes sense for dataset (which I check in the constructor). This is essentially equivalent with read_slice
. I don't think an attribute can hold a byte array?
Do I need to hold a reference to &Container?
You can clone the container to make the lifetime easier. This would handle cleaning up the resources after use.
assumed, io::Read only makes sense for dataset (which I check in the constructor). This is essentially equivalent with read_slice. I don't think an attribute can hold a byte array?
Attributes supports byte arrays, both fixed and variable length. Unfortunately you would have to read the entire attribute at once to get any data.
I have some encoded data stored within a hdf5::Dataset. Currently I'm using:
or if I only want to read a meta data header:
It would be nice to just do:
I guess
std::io::Read
should be simpler to implement, butSeek
would also be nice.In the first place, an uncompressed dataset (i.e. no gzip etc) would be sufficient since compression does not bring any benefit for already encoded data. Not sure if this makes a difference.