aldanor / hdf5-rust

HDF5 for Rust
https://docs.rs/hdf5
Apache License 2.0
310 stars 85 forks source link

How to use compression properly #122

Closed ogiermaitre closed 3 years ago

ogiermaitre commented 3 years ago

This is most certainly a sill question, but I'm not able to use compression properly. This is the code I used:


let mut tmp = Vec::new();

// [...] here I put stuff into the vector

let mut ds_builder = group.new_dataset::<YearlyEntry>();
let ds_builder = ds_builder.gzip(4);

let ds = ds_builder.create(project, (tmp.len(), 1)).unwrap();
ds.write_raw(&tmp).unwrap();

At the end, the file is pretty large (~2.1GB), and I'm able to zip it to a ~90MB file. Did I miss something ?

magnusuMET commented 3 years ago

@ogiermaitre You can check if the variable has the gzip filter using h5stat. If it is set you might have to adjust the chunk-size as the automatic chunk size might be a poor choice

ogiermaitre commented 3 years ago

Thanks for your fast answer. The compression seems to be activated in h5dump, for my dataset:

Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 0
    Dataset layout counts[CHUNKED]: 1
    Dataset layout counts[VIRTUAL]: 0
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 0
        GZIP filter: 1
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0

I tried to set the chunk_size, but I didn't find the right function to do it, so I tried the DatasetBuilder::chunk method, without real improvement (it's not really clear to me, how I can compute the chunk shape, in order to set a given chunk_size).

No compression: 2.2 GB GZIP standard shape: 2.1GB (num_chunks: 1024). GZIP with shape (1_000, 1): 2.1GB (num_chunks: 1505) GZIP with shape (1_000_000,1): 2.1GB (num_chunks 2).

Is there a way to set the chunk_size directly ? I can provide the complete h5stat if it helps.

magnusuMET commented 3 years ago

The chunk method on the DatasetBuilder should set the chunk_size (e.g. (1024*1024, 1) gives a chunk of 1024^2 entries). Is YearlyEntry a numeric type or some custom type?

ogiermaitre commented 3 years ago

Is something like: ds_builder.chunk((10_000,1)); a valid configuration ?

The type is indeed a custom type, and is defined like that:

#[derive(hdf5::H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
struct YearlyEntry {
    id: i32,
    sum: usize,
    mean: f64,
    std: f64,
    journal: VarLenArray<i32>,
}

The usual length of the journal is 365 (or 366 depending on the year).

magnusuMET commented 3 years ago

Yeah, that would be a valid configuration. I think the problem is with the VarLenArray, which only contains a pointer to the heap and not the actual data: https://forum.hdfgroup.org/t/compression-in-variable-length-datasets-not-working/1276. It is unfortunate that compression is still not supported for variable length arrays even now.

ogiermaitre commented 3 years ago

I had stored similar data in the past in python. And the compression was fine. I guess it used some fixed length array. I could easily use such an array here, but I didn't find any FixedArray here.

Did I miss something or is it impossible to store a fixed array of i32 (or u32) using rust hdf5 ?

magnusuMET commented 3 years ago

I am not certain whether the fault is with hdf5 or this crate. FixedArray takes the form [i32; 366]. H5Type is however not implemented for this size.

@aldanor Could you add 365 and 366 to the impl_array in hdf5-types?

aldanor commented 3 years ago

I wonder if we can just wait till 1.51 when min_const_generics land? (if that allows us to fix it)

mulimoen commented 3 years ago

That is a hefty version bump*, but does indeed solve the problem. Maybe we could gate this on a feature 'const_generics' which we deprecate after some time, with the current impl as a fallback?

aldanor commented 3 years ago

Yea I didn't mean bump MSRV to 1.51 of course, more like a feature gate, until 1.51 appears in all distros (probably by the end of 2021).

we don't really specify a MSRV

^ which is pretty bad :)

aldanor commented 3 years ago

Fixed in #131 (we now support const-generics for arrays)