aldanor / hdf5-rust

HDF5 for Rust
https://docs.rs/hdf5
Apache License 2.0
309 stars 82 forks source link

Reading / writing #14

Closed ghost closed 5 years ago

ghost commented 7 years ago

This may be a stupid question, but I see the Dataset.read method being used in your example code in issues like #9 and I can't actually find documentation, or source code, for it. :)

I guess examples of use would be very helpful, as in #9, but I thought at least I could find the code to help me understand things! I have lots of questions, about what kinds of structured-types can be stored in datasets, whether datasets can be iterated to avoid loading them all into memory, etc. etc.; happy to answer these questions myself if I know where to look.

Thanks for what appears to be a very complete and powerful library!

Enet4 commented 7 years ago

I was once interested in this library for exactly that purpose. It happens that the code for reading from an existing dataset is in a separate branch, and is currently not documented. It might also be incomplete for certain data types.

Maybe the owner could use some help. We might be able to join forces around this functionality and document the working parts for future users.

aldanor commented 7 years ago

@Enet4 @cathalgarvey Hi guys!

@Enet4 is correct, the type mapping is in a separate branch (types). I've rewritten it at least 3 times and it's pretty much complete -- supports all builtin types, tuple types, fixed-size array types, all 4 kinds of strings (fixed/varlen, ascii/utf-8 -- this already makes it support more atomics than e.g. pytables / h5py), plus it uses procedural macros to bind structs and nested structs, like so #[derive(H5Type)]. Last time I've had free time was this spring -- which is when I've managed to complete the type system. The next time will be this fall / end of summer -- could try to implement the next batch of features to make this usable.

This way, most of the super-low level stuff that deals with types is already done. The next one is Dataset API. I have a few implementations but I don't like neither of them too much for different reasons. One is Dataset<T: H5Type> where the element type is a type parameter. However, this way you lose the ability to inspect dataset header without caring for its type (e.g. check size / shape / ndim / filters and other stuff not depending on the type). The other way is having two types for datasets, and a common trait, this way things get pretty complicated type-wise very fast. Another way is having MaybeH5Type trait and using Dataset<T: MaybeH5Type>, which is also quite ugly, etc...

The ideal thing would be to get rid of all traits altogether (e.g. Container, Location) and use specializations instead; so you have things like type Dataset = Object<DatasetID>, type UntypedDataset = Object<UntypedDatasetID>. I've started doing work on that but then, after it was done, stumbled upon this unfortunate rustdoc bug https://github.com/rust-lang/rust/issues/32077 which made all docs disappear. Very sad since that would definitely be the best solution. I don't have sufficient rust internals knowledge to fix it upstream, the best thing I can think of is opening a bounty on bountysource :)

Yes I could def. use some help -- I've no problem implementing things, and I know HDF5 core C API very well, most of the pain is in decision making.

Enet4 commented 7 years ago

Hmm, so let's think...

I have a few implementations but I don't like neither of them too much for different reasons. One is Dataset<T: H5Type> where the element type is a type parameter. However, this way you lose the ability to inspect dataset header without caring for its type (e.g. check size / shape / ndim / filters and other stuff not depending on the type).

I would imagine having an implementation of H5Type on a new type with dynamic type fetching (DynValueType or something). Is there something discarding this possibility right now?

Another way is having MaybeH5Type trait and using Dataset<T: MaybeH5Type>, etc...

But yes, I suppose if the above can't be done, a more relaxed trait could be used.

The ideal thing would be to get rid of all traits altogether (e.g. Container, Location) and use specializations instead; so you have things like type Dataset = Object, type UntypedDataset = Object. I've started doing work on that but then stumbled upon this unfortunate rustdoc bug rust-lang/rust#32077 which made all docs disappear. Very sad since that would definitely be the best solution.

I don't know much about how the Object API currently works, so I'll just ask for some clarifications here. Are you suggesting some kind of interface inheritance based on the parameter type? Is this preferable over using supertraits (e.g. trait Dataset: Object { ... })?

aldanor commented 7 years ago

I would imagine having an implementation of H5Type on a new type with dynamic type fetching (DynValueType or something). Is there something discarding this possibility right now?

I'm not sure I follow, would you care to expand?

I don't know much about how the Object API currently works, so I'll just ask for some clarifications here. Are you suggesting some kind of interface inheritance based on the parameter type? Is this preferable over using supertraits (e.g. trait Dataset: Object { ... })?

Basically yes, interface inheritance based on the parameter type. Currently, there are a bunch of traits like Object, Location, Container, etc (loosely following class hierarchy of HDF5 C++ API), sort of like this:

trait ID { fn id() ... }
trait FromID {}

// note that these traits are "dummy", they don't have any methods to implement
trait Object : ID {}
trait Location : Object {}
trait Container : Location {}

struct Group { id: ... }
impl ID for Group { fn id(): ... }
impl Object for Group {}
impl Location for Group {}
impl Container for Group {}

struct PropertyList { id: ... }
impl ID for PropertyList { fn id(): ... }
impl Object for PropertyList {}

There are a few problems with a trait-based inheritance here:

  1. All HDF5 objects have the same data in them -- it's just the integer ID of the object. But because there's no data inheritance in Rust, each and every concrete object type (e.g. Group) has to store the ID and implement some trait (ID) that extracts it and another trait (FromID) that tries to create the object from an existing ID. It's quite a lot of copy/paste and duplication, definitely more than I'd like -- for an example, see e.g. this.

  2. Downstream code has to bring all these traits into user namespace which is far from ideal and makes the API hard to explore (this is unlike the C++ or Python API). For instance,

    let group = file.group("foo")?;
    println!("{}", group.name());  // ~ERROR: must bring hdf5_rs::Location into scope
  3. Requires creating "intermediate" traits -- for example, in the case of dataset, there must be concrete types for typed / untyped datasets --> hence there must be a AnyDataset trait (for the lack of better name), because that's the only way we can share the functionality between interfaces.

There're definitely more problems but these are the ones that come to mind first. This just feels like misusing Rust's traits (given that most of them have empty interfaces anyway, they're just "dummy").

What I wanted to do is inheritance based on type parameter so some of the problems listed above go away, like shown here: https://github.com/rust-lang/rust/issues/32077#issuecomment-301944600. And it kind of worked if it wasn't for the rustdoc woes.

aldanor commented 7 years ago

// I've opened a bounty on bountysource for the rustdoc issue, who knows, maybe someone will get to fixing it :)

thibaultbarbie commented 5 years ago

I am also interested to an example of how to read a dataset. My hdf5 dataset is really simple, just some groups with matrix inside. I am still a rust beginner and could not find how to do it just by reading the documentation. Could you please provide a minimal example ?

Enet4 commented 5 years ago

As described here and in a few other issues (#9, #17, #18), dataset reading/writing is yet to be exposed at the high-level abstraction. I noticed that @aldanor has been working on the crate recently, so I hope to see this become available soon-ish. :crossed_fingers:

aldanor commented 5 years ago

It's funny to find out someone's still tracking this :) Yep, after a long pause I've resumed work on the crate (I need it too and I need it working soon, myself!), you can see the current work in https://github.com/aldanor/hdf5-rs/tree/feature/2018. The crate now fully supports HDF5 1.10.0 and 1.10.1, too, on all platforms.

As a matter of fact, I already have a local branch working that can read N-dimensional datasets into both ndarray and Vec, I'm now working on finalizing the API, adding the writing part etc. There's tons of work (e.g. type conversion logic etc), but I do expect to publish something working in the next few months or so.

The type system is already done, including procedural macros with #[derive] (examples in tests).

The object system discussed earlier in this thread is done too, via Deref (so that documentation would kind of work), all traits are gone now.

I understand everyone's eager to have something they can use, but so am I, and I'm trying to juggle development on this crate with personal life on a daily basis as you can observe from the commit history...

pmarks commented 5 years ago

Hi @aldanor. We (@10xGenomics) are quite interested in Rust HDF5 support. We'd be happy to test / contribute to high-level Read/Write implementations whenever you're ready to share your branch, even if it's in a raw form. The new branch looks great so far!

aldanor commented 5 years ago

Hi folks, here's a topic for some brainstorming and bikeshedding, apologies in advance for a huge wall of text :) If you are interested in speeding this up, any ideas are definitely welcome.

I'm currently trying to figure out what's the most ergonomic / least annoying while at the same point type-safe way of representing reads and writes. I have the code that does read n-dimensional arrays successfully but it needs to be wrapped nicely. For simplicity, let's say we're just talking about full-dataset/attribute reads and writes here.

Here's a few facts:

  1. There's a trait (H5Type) for all Rust types that can be noop-converted both ways to HDF5 types. This includes all atomic types, all four types of HDF5 strings including variable-length, fixed array types, variable length aray types, plus user-defined compound types via #[derive(H5Type)].
  2. There's a notion of a stored HDF5 type and a memory HDF5 type. Stored is what's in the HDF5 file; memory is what ends up in your memory. If stored and memory types are the same, there's no conversion (noop conversion).
  3. Given two types (source/destination), there's 4 different options: there's no conversion path; no-op conversion is available (same types); hard conversion (compiler conversion); soft conversion (can be user-extended). For reads: source=stored, destination=memory; for writes: source=memory, destination=stored.
  4. Unlike Python, where we can just dataset.read() and it just returns an array with a proper type (so a noop conversion will be used automatically because stored=memory), in Rust we have to specify a memory type manually.
  5. HDF5 library will not throw any errors or whatever if a non-noop conversion path is used, it will just convert elements automatically. However the only zero-cost way of reading/writing is by using the noop conversion path (which requires memory=stored). It would be nice to keep it as a default and provide an easy way of throwing an error that warns of non-trivial conversion (unless the user explicitly requests it). An example of such conversion is if you read an int64 dataset into int8 memory storage and by default the HDF5 library will just silently clip all values between -128 and 127 -- I'd rather prefer it complain loudly. Actually, I bet many (most?) HDF5 users won't be able to tell you what exactly happens under the hood if you read an int64 dataset into an int8; or a struct with an int64 field into a similar struct with an int8 field.

How do we turn this into an API?

In the current temporary prototype, I have something like this (all Array symbols are from ndarray crate):

// both Attribute and Dataset dereference to Container
// Container dereferences into Location
impl Container {
    pub fn read_vec<T: H5Type>(&self) -> Result<Vec<T>> { ... }
    pub fn read_1d<T: H5Type>(&self) -> Result<Array1<T>> { ... }
    pub fn read_2d<T: H5Type>(&self) -> Result<Array2<T>> { ... }
    pub fn read_scalar<T: H5Type>(&self) -> Result<T> { ... }
    pub fn read_dyn<T: H5Type>(&self) -> Result<ArrayD<T>> { ... }
    pub fn read_arr<T: H5Type, D: Dimension>(&self) -> Result<Array<T, D>> { ... }
    // also all write_*() methods
}

When reading, it would check that the conversion path stored->memory is noop (that is, it would create an HDF5 datatype from T, then load the stored datatype, and figure out the conversion path between two).

There's a few drawbacks here:

... which leads to a possibility of having a TypedContainer<T> alongside the untyped container:

impl Container {
    pub fn read_vec<T: H5Type>(&self) -> Result<Vec<T>> { ... }
    pub fn of_type<T: H5Type>(&self) -> Result<TypedContainer<T>> { ... } // noop
    pub fn cast_soft<T: H5Type>(&self) -> Result<TypedContainer<T>> { ... }
    // etc
}

impl<T: H5Type> TypedContainer<T> {
    pub fn read_vec(&self) -> Result<Vec<T>> { ... }
    // etc
}

... but here's one problem: if TypedContainer handles reads and writes, a single conversion path is not sufficient anymore. Basically, because reads are storage->memory and writes are memory->storage, there do exist types for which conversion paths are not symmetric (e.g. soft one way and non-existent the other way). So, you would need something like cast_read_soft() and cast_write_soft() which already starts getting somewhat unwieldy (and btw there does exist an enum for conversion paths covering all three cases).

Then I was thinking, maybe it could be done builder-style? E.g. hypothetically, something like...

let _ = dataset.read::<T>()?.vec()?; // read() returns ReadContainer
let _ = dataset.read_soft::<T>()?.arr_2d()?;

As for writes, until trait specialization lands, we can't provide blanket impls for all H5Type-covered types, so for now it would have to be something like

let _ = dataset.write::<T>()?.scalar(&x)?; // write() returns WriteContainer
let _ = dataset.write_soft::<T>()?.arr(&arr)?;
let _ = dataset.write_hard::<T>()?.slice(&vec)?;

(It might be somewhat confusing that .read() doesn't actually read anything though, and .write() doesn't actually write anything)

For writes, there's ton of other options like compression etc, need to think of an ergonomic way to support that (which is currently supported only in the DatasetBuilder). At the very least, it should be possible to reshape the dataset on the fly -- e.g. if you provide the data in a slice/Vec but want to store it as a 2-D array, maybe something like

let _ = dataset.write::<T>()?.shape((10, 20)).slice(&vec)?;

Last thing to note, the methods above could coexist with their alternative forms, e.g. you could have all of these

let _ = group.read_vec::<T>("foo")?;
let _ = group.dataset("foo")?.read_vec::<T>()?;
let _ = group.dataset("foo")?.read::<T>()?.vec()?;

Something like that...

Anyways, there's tons of ways to approach this, but we need to pick one :)

aldanor commented 5 years ago

Actually, pinging @bluss too -- hope he doesn't mind :) (since we're planning to use ndarray as the core backend here anyway)

pmarks commented 5 years ago

Cool, thanks for the background!

As for writes, until trait specialization lands, we can't provide blanket impls for all H5Type-covered types, so for now it would have to be something like

let _ = dataset.write::<T>()?.scalar(&x)?; // write() returns WriteContainer
let _ = dataset.write_soft::<T>()?.arr(&arr)?;
let _ = dataset.write_hard::<T>()?.slice(&vec)?;

So to clarify, the proposal is that dataset.write::<T>() would return Ok(WriteContainer<T>) if there's a noop path from storage->T, and Err() otherwise. Whereas dataset.write_hard::<T>() would return Ok(WriteContainer<T>) if there was a noop or hard conversion available, but would return Err(<invalid conversion>) if only a soft conversion was available? I really like the typed reader / writer idea, and it doesn't seem to hurt ergonomics much to split TypedContainer into ReadContainer and WriterContainer. I'd even be in favor of dropping the read/write methods on Container to keep things simple. In my world you're generally either reading or writing a given dataset. Interleaving reads and writes is rare.

(It might be somewhat confusing that .read() doesn't actually read anything though, and .write() doesn't actually write anything)

One simple change would be to call the method dataset.get_writer(),

For writes, there's ton of other options like compression etc, need to think of an ergonomic way to support that (which is currently supported only in the DatasetBuilder). At the very least, it should be possible to reshape the dataset on the fly -- e.g. if you provide the data in a slice/Vec but want to store it as a 2-D array, maybe something like

let _ = dataset.write::<T>()?.shape((10, 20)).slice(&vec)?;

I really like the current DatasetBuilder setup. Options like compression, chunking, resize-ability need to be specified at dataset creation time, right? I don't see how they can be added to the WriteContainer. (e.g. http://docs.h5py.org/en/stable/high/dataset.html#creating-datasets). I'm most interested in the simple matched-dimensionality read/write cases. Would establishing that API force decisions that would impact a future 'reshape-on-the-fly' API?

aldanor commented 5 years ago

@pmarks Thanks for the response. I'll try to address your points below.

So to clarify, the proposal is that dataset.write::() would return Ok(WriteContainer) if there's a noop path from storage->T, and Err() otherwise. Whereas dataset.write_hard::() would return Ok(WriteContainer) if there was a noop or hard conversion available, but would return Err() if only a soft conversion was available?

Yes, that is correct (i.e., "noop"=noop, "hard"=noop|hard, "soft"=noop|hard|soft). It could also be possible to provide a way to pass conversion as an enum value (since the enum type exists anyway), like dataset.write_conv::<T>(Conversion::Soft), but I don't see much use for it anyway unless you're writing some generic crate around this crate -- since you pretty much always know what you need in advance, e.g. noop/hard/soft.

I'd even be in favor of dropping the read/write methods on Container to keep things simple. In my world you're generally either reading or writing a given dataset. Interleaving reads and writes is rare.

To think about, that's my experience too. In my many years of dealing with HDF5 in all possible languages, contexts and platforms, I can't think of a case where I had both a read and a write on neighboring lines of code. It's typically either one or the other.

One simple change would be to call the method dataset.get_writer(),

This might be one way, yea, hmm... Definitely heading into bikeshedding territory but, then I think you would need to add "read/write" prefixes back to its methods, like "dataset.get_reader::()?::read_arr()?" (so it reads more or less like an English sentence; kind of like with the currently proposed "dataset ... read ... array"). But then you have "read" twice and so it ends up being more verbose :) Need to ponder a bit on this.

Options like compression, chunking, resize-ability need to be specified at dataset creation time, right?

Correct. Post-creation the only two classes of options you can configure are dataset-access and data-transfer plists. Note also that it should be possible to augment the DatasetBuilder with write functionality (kind of like in h5py you can provide data=), so instead of just saying create(), you could end it with write() and it would instantly provide you with a typed write container (always assuming no-op conversion here, I guess, since you've just passed it a Rust type to instantiate it). Like: DatasetBuilder::new::<T>().chunk_auto().write().slice(&vec) or something of the sort.

I'm most interested in the simple matched-dimensionality read/write cases.

Me too. But the folks have been already asking above re: "whether datasets can be iterated to avoid loading them all into memory" etc, I'd expect people to have very different use cases so it'd be nice for the implementation to be flexible enough to suit most of them.

Enet4 commented 5 years ago

HDF5 library will not throw any errors or whatever if a non-noop conversion path is used, it will just convert elements automatically. However the only zero-cost way of reading/writing is by using the noop conversion path (which requires memory=stored). It would be nice to keep it as a default and provide an easy way of throwing an error that warns of non-trivial conversion (unless the user explicitly requests it). An example of such conversion is if you read an int64 dataset into int8 memory storage and by default the HDF5 library will just silently clip all values between -128 and 127 -- I'd rather prefer it complain loudly. Actually, I bet many (most?) HDF5 users won't be able to tell you what exactly happens under the hood if you read an int64 dataset into an int8; or a struct with an int64 field into a similar struct with an int8 field.

As the owner of another multidimensional-array-oriented Rust library (nifti), I faced a similar situation when it comes to data element conversion from the original, "stored" data type to a Rust "memory" type. Its current stance is that elements are automatically converted in a way that reduces precision errors as much as possible (there's a scalar affine transformation in-between), but ultimately, the user should know what to expect ahead of time: if they tried reading a i8 out of a i64, this is perceived as a design problem on their side, and right now it's not really prevented nor warned (not implying that this is ideal either :disappointed: ). Unlike HDF5 however, NIfTI-1 does not feature as many data types, and most of them can be numerically cast from one to the other (i.e. no string or compound types other than complex numbers and RGB).

(It might be somewhat confusing that .read() doesn't actually read anything though, and .write() doesn't actually write anything)

We may interpret this as an adaptation of the receiver for reading or writing purposes. If we see this as a separate construct for reading and writing, then what's been suggested by @pmarks makes sense, but it is more conventional (C-GETTER) to name these methods .reader() and .writer() (no get_ prefix). On the other hand, we may also see this as a temporary, almost-cheap conversion of the dataset abstraction to a specific view that is focused on either reading or writing (C-CONV), which in that case we'd name them as_reader() and as_writer(). I'm afraid that we cannot avoid the stuttering in data.as_reader().read() unless we provide other interesting methods in the reader/writer types (perhaps even slicing?).

The same naming conventions can be applied to the other ideas above (e.g. as_reader_soft() or as_soft_reader(), etc.).

pmarks commented 5 years ago

...Note also that it should be possible to augment the DatasetBuilder with write functionality (kind of like in h5py you can provide data=), so instead of just saying create(), you could end it with write() and it would instantly provide you with a typed write container (always assuming no-op conversion here, I guess, since you've just passed it a Rust type to instantiate it). Like: DatasetBuilder::new::<T>().chunk_auto().write().slice(&vec) or something of the sort.

So like this?

impl DatasetBuilder<T> {
     pub fn finish(self) -> Dataset {}
     pub fn write(self) -> WriteContainer<T> {} // Does WriteContainer just hold the Dataset hid directly, or does it need to hold a reference to a Container?
     pub fn read(self) -> ReadContainer<T> {} // Not sure if this makes sense.
}

One question is what type owns the H5D dataset_id? Presumably we want to call H5Dclose from the Drop of that type. The easiest thing to reason about is that Dataset owns the dataset_id, and sets the lifetime of the H5 object. If that's the case the WriteContainer either needs to:

  1. have a reference to Dataset, and have a compatible lifetime. This feels the most 'Rust-y' to me.
  2. own the Dataset (i.e. you consume the dataset to create WriteContainer)
  3. hold an Arc<Dataset>

Option 1 is incompatible with the above signature of DatasetBuilder::write, which makes me a bit skeptical of that 'short circuit' method.

I'm most interested in the simple matched-dimensionality read/write cases.

Me too. But the folks have been already asking above re: "whether datasets can be iterated to avoid loading them all into memory" etc, I'd expect people to have very different use cases so it'd be nice for the implementation to be flexible enough to suit most of them.

Sorry, I wasn't clear: partial loading of data slices is important to us. On-the-fly changing the number of dimension (ie reshaping) feels like it could be deferred, as long as decisions about simpler API don't close off options for a reshaping interface.

We'll be happy to test / iterate on designs when you're ready to share an initial cut!

aldanor commented 5 years ago

On the other hand, we may also see this as a temporary, almost-cheap conversion of the dataset abstraction to a specific view that is focused on either reading or writing (C-CONV), which in that case we'd name them as_reader() and as_writer(). I'm afraid that we cannot avoid the stuttering in data.as_reader().read() unless we provide other interesting methods in the reader/writer types (perhaps even slicing?).

Agreed. Reader/Writer is quite common terminology in stdlib as well.

pub fn read(self) -> ReadContainer<T> {} // Not sure if this makes sense.

I don't think that read() makes any sense here since the dataset hasn't been created yet.

WriteContainer either needs to:

  • have a reference to Dataset, and have a compatible lifetime. This feels the most 'Rust-y' to me.
  • own the Dataset (i.e. you consume the dataset to create WriteContainer)
  • hold an Arc<Dataset>

Well, all the low-level ID stuff has already been implemented, each object holds an Arc<RwLock<hid_t>>, plus there's a mutex-protected registry of all open handles. As such, all high-level objects can be safely arc-cloned.

Options:


While writing all this, I've just realized, do we really need the writer to be typed? Or do we need it at all? (if do, we are basically not making use of Rust's type inference at all while we could be). So, instead of writing

dataset.write_hard::<i32>()?.write_slice(&vec)?;

Why not just...

dataset.as_writer_hard().write_slice(&vec)?;

Or

dataset.write_slice_hard(&vec)?;

(etc)

With reads... I've just checked, it looks like in simplest cases type inference works, i.e. you can do stuff like this:

let data: Vec<i32> = dataset.as_reader()?.read_vec()?;

which is pretty neat.

Whether as_reader() should return the reader directly (so that errors, including conversion path errors, could only occur when actually reading), and whether the reader is needed at all (see above) is an open question ofc. E.g., given that the most common (and the safest) use case is to use no-op casting, it might be nice to just have a shortcut that also works with type inference, like

let data: Vec<i32> = dataset.read_vec()?;

(and for hard/soft conversions it would be much more verbose, but it will likely be used much more rarely)


Slightly off-topic, here, but in regards to H5Dclose(), H5Gclose(), H5Fclose() and all that, I don't really see the point in using any of them if we track the IDs properly (correct me if I'm wrong). This requires reading some of the libhdf5 source:

As a minor side detail, I'd probably be inclined to add a Dataset::build() which would return the builder -- simply so that there are less symbols to import, and so that all dataset-related docs are in one place (same applies to other builders).

Another minor note, if we even keep them at all, ContainerReader and ContainerWriter would probably be better names. Or just Reader and Writer in a proper module.


All in all, thanks for responses, this helps a lot :)

aldanor commented 5 years ago

Ok guys, reporting on some Christmas progress 😄

Enet4 commented 5 years ago

Since you could say that the library is in somewhat usable state now (i.e. you can bind types, read and write stuff), would be nice if someone were to early-test it (the feature/2018 branch) to report any inconveniences.

OK, I grabbed one of my own hdf5 files and managed to read it in Rust. :tada: A bit of feedback:

aldanor commented 5 years ago

@Enet4 Very nice, thanks! Will try to address your points below.


A traditional mode string to describe the mode of operation over a file (e.g. "r+") may seem intuitive, but it's not something I would usually find in Rust. Would it be too problematic to keep open as a single-argument method for read-only access, like in std::fs::File? After all, a file builder is already available when a more specific mode is needed.

IIRC this was pretty much copied from h5py directly without giving it too much thought. I guess it would be nice to have some of the API resemble h5py by not introducing anything too cryptic, so it would be an easier transition for hdf5 users. Here's h5py implementation.

That being said, I already have a bullet point on my todo on reworking File constructor, e.g. enums and builders instead of strings (same goes for driver modes), to be more Rust-like; also maybe splitting open/create options, etc.

So, currently it's like this (not saying it's ideal):

let f = h5::File::open("foo.h5", "r")?;  // Read-only, file must exist
let f = h5::File::open("foo.h5", "r+")?; // Read/write, file must exist
let f = h5::File::open("foo.h5", "w")?;  // Create file, truncate if exists
let f = h5::File::open("foo.h5", "w-")?; // Create file, fail if exists (or "x")
let f = h5::File::open("foo.h5", "a")?;  // Read/write if exists, create otherwise

If we change it to be like exactly fs::File and OpenOptions as is though, it would be:

let f = h5::File::open("foo.h5")?;
let f = h5::File::with_options().write(true).open("foo.h5")?;
let f = h5::File::with_options().create(true).write(true).truncate(true).open("foo.h5")?;
let f = h5::File::with_options().create_new(true).write(true).open("foo.h5")?;
let f = h5::File::with_options().create(true).write(true).append(true).open("foo.h5")?;

which hurt my eyes a bit TBH :crying_cat_face: The first option becomes nice, but everything else is super ugly.

I guess you could try simplifying the latter a bit given that there's a limited set of options here, and there's no concept of write-only in HDF5 so read is always implied. Also all options except "r" are writable, out of which all except one create a new file.

Maybe just providing File::with_options().mode("w+").open("...").

Maybe self-explanatory aliases like this? (open/create ctors exist in fs::File as well)

Maybe both.


A crate without root documentation leaves a strong negative signal, worse in a library of this complexity. I already knew that I had to create a file descriptor, but I didn't have much of a clue on what to do next without looking deeper into the docs. Hopefully it will have at least one example before a release is made.

Not gonna argue here of course :) (If/when this whole thing gets stable, I was planning to start a little mdbook on how to use this, with more general concerns like layouts and thread-safety, things not to do etc; along with some examples -- it's pretty hard to fit stuff like that into docstrings).


The C-DEREF guideline advises against implementing Deref for anything other than smart pointers. I wonder if the thorough use of dereferencing in this crate will affect usability.

Yep, I know... any other solutions welcome, really, but I've been banging my head against the wall for a long time before switching to deref-based hierarchy. There were many different prototypes, the one before this (in the master) is trait-based, which is also quite horrible as the functionality is now split between tons of different traits that have to be in scope, and leads to many other ugly problems.

What's nice about deref-based approach: (1) for example you can pass an &h5::Dataset anywhere a &h5::Container or &h5::Location or even &h5::Object is required and it would just work, without any crazy trait bounds or generics on the user side, kind of like in c++; (2) all parent methods just work, without having to bring anything into scope.

What's not nice, mostly cosmetics/docs: (1) rustdoc doesn't go more than one level deep into deref impls; this is a known problem but unfortunately noone wants to spend time fixing it; (2) some IDEs like intellij rust may sometimes have problems with deep derefs.

Other than that, although we're breaking some conventions here, I think it leads to better end-user experience (and the low-level stuff is pretty hardcore anyway, so a deref sure isn't the worst of it).


Reader and Writer are not being documented because they lie behind a private module. I don't think there's a problem with making container public.

Hmm... but container module is public?


Edit: I forgot to ask, how is the roadmap for chunked/streamed reading? Right now it seems that I have to fetch the whole data set to memory.

TBH I never really used that part myself, maybe because I typically have boxes with tons of memory :) But I've seen some h5py docs about it (the "slicing" / indexing bit, not the streaming).

So yea, right now you have to fetch the whole part. How would the API for chunked/streamed reading look like, hypothetically, and which HDF5 API calls we are wrapping? (and does streamed reading make sense when there's by-chunk compression?).

aldanor commented 5 years ago

Re: file constructors, I was thinking, in all my experience with HDF5, I don’t think I have ever passed anything but a hard-coded string literal as mode in h5py’s File constructor. Like, you typically know if you are reading, writing or creating well in advance and it’s not going to change.

With this in mind, maybe five pre-defined constructors for File indeed make sense? (with mode strings thrown out, and without any support in the builder so as not to clutter it). This means the builder would also need five finalisers I guess. Some of the builder’s options only make sense when creating a file so it’s not entirely clean, but that’s already a problem, and std::fs::OpenOptions suffers from it as well so it’s kinda normal.

// Also, eventually, I'd like to split with_options() into open_options() (file access options only ~ h5f access plist) and create_options() (file creation options ~ h5f create plist), such that the latter mut-derefs into former as well so you can still configure access stuff when creating files. Then the five finalisers can then be split 2+3 accordingly, so it may end up being a bit cleaner.

Enet4 commented 5 years ago

which hurt my eyes a bit TBH :crying_cat_face: The first option becomes nice, but everything else is super ugly.

I would format them like this:

let f = h5::File::with_options()
    .create(true)
    .write(true)
    .append(true)
    .open("foo.h5")?;

Can't say it's very ugly in this form, but we're certainly in subjective grounds.

Hmm... but container module is public?

Right... But hl is not, which in turn makes everything else hidden from outside this module. You can either re-export Reader and Writer into a public part of the tree or make hl public.

TBH I never really used that part myself, maybe because I typically have boxes with tons of memory :) But I've seen some h5py docs about it (the "slicing" / indexing bit, not the streaming).

So yea, right now you have to fetch the whole part. How would the API for chunked/streamed reading look like, hypothetically, and which HDF5 API calls we are wrapping?

Well, this is the fun part! To the best of my knowledge, there is no public attempt at something of this sort in Rust. I have a similar feature in mind for nifti, but I don't quite know how this would translate in terms of both API and implementation details. I also don't know the native HDF5 API enough for an educated guess.

However, we do know that h5py data set objects are proxies to a disk backed array. When slicing is called, only the respective chunks of data need to be fetched from the file, only then resulting in an in-memory object of that slice. Caching is likely to be involved as well.

(and does streamed reading make sense when there's by-chunk compression?).

I feel that this form of reading makes even more sense when by-chunk compression is involved: it often means that we're dealing with very large data, so we don't want to keep it all in memory. The fact that h5py supports this might even be one of the reasons why some people keep using HDF5.

With this in mind, maybe five pre-defined constructors for File indeed make sense? (with mode strings thrown out, and without any support in the builder so as not to clutter it). This means the builder would also need five finalisers I guess. Some of the builder’s options only make sense when creating a file so it’s not entirely clean, but that’s already a problem, and std::fs::OpenOptions suffers from it as well so it’s kinda normal.

We can give that a try and see if it works out. My only concern is that a string-based mode is not very idiomatic.

aldanor commented 5 years ago

Right... But hl is not, which in turn makes everything else hidden from outside this module. You can either re-export Reader and Writer into a public part of the tree or make hl public.

Doh, yea apologies, I've re-exported those at the root now (along with Conversion enum), pushed already. There's probably a few more 'hidden' types here and there now. Reason being: I was planning to add pure-reexport modules later at the crate root, aggregating types from different modules into categories - e.g. file module would aggregate h5::hl::file, plist::file_create, plist::file_access, etc.

Well, this is the fun part!

Yea, I guess :) We'll sure get to it, but after there's a stable release with all basics working. Syntax-wise, may borrow some from ndarray, maybe even borrow their indexing machinery with traits and all.

We can give that a try and see if it works out. My only concern is that a string-based mode is not very idiomatic.

Deal.


Here's another possibility for brainstorming bikeshedding :)

Another thing about the API for creating datasets -- a very common pattern in h5py would be to create a dataset with data instead of creating an empty dataset with a given shape, and then writing to it later. It would be definitely nice to be able to do that, will need to give it some thought and rework DatasetBuilder a bit.

The way h5py does it is pretty smart -- to avoid cluttering your file in case there's a write failure, it creates an anonymous dataset and writes to it, and only links it when the write succeeds (in create_dataset()).

Currently it's like this:

// named
ds = group.new_dataset::<T>().some().options().create("foo", arr.shape())?;
ds.write(&arr)?;

// anonymous
ds = group.new_dataset::<T>().some().options().create("foo", arr.shape())?;
ds.write(&arr)?;

Note how we have to specify the shape explicitly which is kind of redundant... Maybe:

// named
group.create_dataset::<T>("foo").some().options().empty()?;
group.create_dataset::<T>("foo").some().options().write(&arr)?;

// anonymous
group.create_dataset::<T>(None).some().options().empty()?;
group.create_dataset::<T>(None).some().options().write(&arr)?;

Here, .empty() and .write() are just guesses, e.g. the former could be new_empty(), the latter could be .with_data(), etc, idk. So, we have gotten rid of providing shape explicitly if we're writing to the dataset right after creating it anyway.

Allowing both &str and Option<&str> should be pretty straightforward, so as not to introduce extra constructors (I think).

Important: note that the element type of arr doesn't have to be T; because conversions are allowed, you could create an array of u32s and write a bunch of u16s there (why would you ever do that? but you technically could).

What we don't have here though is an absolute lack of any type inference -- most (almost all) of the time when creating a new dataset and immediately writing data to it, we'll want datatypes to match? So to cover this most common case, we could force the finalizer (.write()) to only accept array views of type T (and if the users desire otherwise, then can do it in two steps, first creating an empty array and then writing to it). Then the type inference should work in a Rust-like fashion, I believe:

// named
group.create_dataset("foo").some().options().write(&arr)?;

// anonymous
group.create_dataset(None).some().options().write(&arr)?;

So now we don't have to provide the type and we don't have to provide the shape, both are inferred from the view.

In the simplest "hello world" kind of case, when you're just creating a dataset and writing a vector of stuff to it, without configuring much, it would look like this, which is quite friendly, I think:

group.create_dataset("foo").write(&vec)?;
aldanor commented 5 years ago

There's obviously further work required (and more bikeshedding), but basic multi-dimensional dataset reading/writing has been implemented and tested, so I think this can be closed for now.