aldanor / hdf5-rust

HDF5 for Rust
https://docs.rs/hdf5
Apache License 2.0
308 stars 82 forks source link

Runtime creation of composite structures #128

Closed Chayatan closed 2 years ago

Chayatan commented 3 years ago

I am just starting to use the crate and the sample example doesn't seem to cover all the capabilities of the crate. Could anyone please guide me creating a hdf5 file using this crate to have the contents of a composite structure?

u64    i64     f64           bool     string
0        1     2.5            T        one
1       -5     6.0            F        two
2        9     10.2432        F        three

Thanks in advance!

mulimoen commented 3 years ago

By adapting the simple example you get something like this for your usecase:

use std::str::FromStr;

#[derive(Clone, Debug, hdf5::H5Type)]
#[repr(C)]
struct Composite {
    u64: u64,
    i64: i64,
    f64: f64,
    bool: bool,
    string: hdf5::types::VarLenUnicode,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = hdf5::File::create("output.hdf")?;
    let composite = file.new_dataset::<Composite>().create("composite", (3,))?;
    composite.write(&ndarray::arr1(&[
        Composite {
            u64: 0,
            i64: 1,
            f64: 2.5,
            bool: true,
            string: hdf5::types::VarLenUnicode::from_str("one")?,
        },
        Composite {
            u64: 1,
            i64: -5,
            f64: 6.0,
            bool: false,
            string: hdf5::types::VarLenUnicode::from_str("two")?,
        },
        Composite {
            u64: 2,
            i64: 9,
            f64: 10.2432,
            bool: false,
            string: hdf5::types::VarLenUnicode::from_str("three")?,
        },
    ]))?;

    Ok(())
}
Chayatan commented 3 years ago

Hey @mulimoen mulimoen, thanks for sharing this piece of code, this is a good start for me, however in this solution, the struct is hard coded and I want this struct to be decided at runtime.

lets say, I will let the user decide the columns and types information in the beginning and later the user will keep appending the row data accordingly (number of rows is also unknown at compile time).

Any idea or comments on achieving this kind of a scenario is appreciated.

mulimoen commented 3 years ago

You will need to implement H5Type for your compound type. You should have a look in hdf5-types/src/h5type.rs for how to create a CompoundType

aldanor commented 3 years ago

@mulimoen Technically, I guess we could add unsafe methods to allow the user to provide TypeDescriptor and a slice of data of type T: Copy where mem::size_of::<T> == type_descriptor.size(), or something like that, for reading and writing (given that Reader/Writer only use H5Type to extract type descriptor really). Or maybe even not provide a type descriptor at all; so only copyability and sizeof will be checked.

Then the question only remains how to create datasets with dynamic type descriptors - but IIRC I've already added that in feature/dcpl branch (and if not, we can add it).

mulimoen commented 3 years ago

@aldanor I guess this would be problematic with arrays/strings that can't be Copy? The example does have a string which would make this problematic. I "solved" this problem in netcdf by requiring composite types to be read as a binary blob which the user must decode themselves, including freeing memory where applicable. This is not ideal however.

aldanor commented 3 years ago

@mulimoen I think this might be a good start even if it's only supported for copyable types, that would already cover many use cases for dynamic compound types. Writing non-copyable types is not a problem, obviously. For reading, I think if you wanted to make it nice, it's possible, there would have to be a wrapper around ndarray::Array of some sort which would know the in-memory type descriptor, it would implement manual Drop which would run over all entries on drop and destroy them, and it would provide a strided view of each field (by name), kind of like a very simplified pandas dataframe of sorts.

aldanor commented 3 years ago

@Chayatan

this is a good start for me, however in this solution, the struct is hard coded and I want this struct to be decided at runtime.

Maybe if you could explain to us what exactly you mean by "struct to be decided at runtime" we can be of more help. Rust is not Python, so you can't "decide structs at runtime", your data has to be in some format already.

Chayatan commented 3 years ago

@aldanor @mulimoen for example, if I were to write an interactive user application, where the user,

  1. defines column name with datatype info (col1: int, col2:float, col3: int, col4:string)
  2. defines number of rows
  3. starts entering/appending the row data accordingly for the times specified by num_of_rows

The problem in the given solution is that, I dont have the flexibility to change the number of columns or its type info after the code is compiled.

struct Composite {
    u64: u64,
    i64: i64,
    f64: f64,
    bool: bool,
    string: hdf5::types::VarLenUnicode,
}

what if the user wanted to store a different set of data in the second run? something like just a table of integers and a strings

struct Composite {
    u64: u64,
    string: hdf5::types::VarLenUnicode,
}
aldanor commented 3 years ago

@Chayatan In this case, why would you use a struct? You would probably store your "struct" as an HDF5 group, with each column being a separate dataset.

(or use another storage solution like Arrow/Parquet for pure columnar access in a unified dataset if that's the goal)

Chayatan commented 3 years ago

storing each column as a separate dataset in a HDF5 group sounds great, could you please share an example how I can create a group using this crate

aldanor commented 3 years ago

@Chayatan file.create_group(), group.new_dataset(). Basically, File is a Group, it's kind of the same thing, group is like a subfolder.

aldanor commented 2 years ago

Resolved (the OP was happy to store the variables/columns in separate datasets).