Closed ogiermaitre closed 3 years ago
@ogiermaitre You can check if the variable has the gzip
filter using h5stat
. If it is set you might have to adjust the chunk
-size as the automatic chunk size might be a poor choice
Thanks for your fast answer. The compression seems to be activated in h5dump, for my dataset:
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 0
Dataset layout counts[CHUNKED]: 1
Dataset layout counts[VIRTUAL]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 0
GZIP filter: 1
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
I tried to set the chunk_size, but I didn't find the right function to do it, so I tried the DatasetBuilder::chunk method, without real improvement (it's not really clear to me, how I can compute the chunk shape, in order to set a given chunk_size).
No compression: 2.2 GB GZIP standard shape: 2.1GB (num_chunks: 1024). GZIP with shape (1_000, 1): 2.1GB (num_chunks: 1505) GZIP with shape (1_000_000,1): 2.1GB (num_chunks 2).
Is there a way to set the chunk_size directly ? I can provide the complete h5stat if it helps.
The chunk
method on the DatasetBuilder
should set the chunk_size
(e.g. (1024*1024, 1) gives a chunk of 1024^2 entries). Is YearlyEntry
a numeric type or some custom type?
Is something like: ds_builder.chunk((10_000,1));
a valid configuration ?
The type is indeed a custom type, and is defined like that:
#[derive(hdf5::H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
struct YearlyEntry {
id: i32,
sum: usize,
mean: f64,
std: f64,
journal: VarLenArray<i32>,
}
The usual length of the journal is 365 (or 366 depending on the year).
Yeah, that would be a valid configuration. I think the problem is with the VarLenArray
, which only contains a pointer to the heap and not the actual data: https://forum.hdfgroup.org/t/compression-in-variable-length-datasets-not-working/1276. It is unfortunate that compression is still not supported for variable length arrays even now.
I had stored similar data in the past in python. And the compression was fine. I guess it used some fixed length array. I could easily use such an array here, but I didn't find any FixedArray here.
Did I miss something or is it impossible to store a fixed array of i32 (or u32) using rust hdf5 ?
I am not certain whether the fault is with hdf5
or this crate. FixedArray
takes the form [i32; 366]
. H5Type
is however not implemented for this size.
@aldanor Could you add 365 and 366 to the impl_array
in hdf5-types
?
I wonder if we can just wait till 1.51 when min_const_generics
land? (if that allows us to fix it)
That is a hefty version bump*, but does indeed solve the problem. Maybe we could gate this on a feature 'const_generics' which we deprecate after some time, with the current impl as a fallback?
Yea I didn't mean bump MSRV to 1.51 of course, more like a feature gate, until 1.51 appears in all distros (probably by the end of 2021).
we don't really specify a MSRV
^ which is pretty bad :)
Fixed in #131 (we now support const-generics for arrays)
This is most certainly a sill question, but I'm not able to use compression properly. This is the code I used:
At the end, the file is pretty large (~2.1GB), and I'm able to zip it to a ~90MB file. Did I miss something ?