Closed yarikoptic closed 3 years ago
@yarikoptic HDF5 compression is done by dataset. You could query each dataset to see if it is compressed. Is that what you want to do? pynwb does not have a flag that compresses all datasets. Alternatively, we could just automatically compress the entire HDF5, but that would not require any HDF5 programming.
Yep, it is per dataset, but unlikely any user would need/like to choose it per each dataset. I guess for sensing it would be enough to check some (eg 10) datasets within a file and treat compression of any of them as indicator that likely compression was used
As for pynwb, I will suggest enabling compression by default (thus for all datasets) unless there was targeted investigation of typical cases which showed significant performance hit on topical operations
It has been a design decision of pynwb to leave datasets plain by default. That means no compression and no chunking. If a user wants compression or chunking they must specify that for each dataset. What is the motivation behind checking datasets to see if some of them have been compressed?
Motivation is the observed up to 90% storage and possibly traffic waste.
Related observation - in neuroimaging majority of data is compressed (.nii.gz) although uncompressed is an option and used (rarely) for memory mapped access.
Re design decision - was there some open discussion or document describing reasoning? I might indeed be fighting windmills if compression would complicate some use cases or cause significant performance degradation. But it would be great to see some reasoning
Ok well checking a few isn't going to tell you whether the biggest ones are compressed, since the command must be made for each dataset. I can't think of any strong reasons why datasets shouldn't be compressed by default. I like the idea of chunking by default because it would allow us to grow datasets in append mode. Good luck.
ok then, we will add a mode to ls to get % of compressed datasets. My wild bet is that it is either 0 or very rarely close to 100% and nothing in the middle ;-) since you are the one producing many of them, you can beat me to it and prove that I am wrong! ;-)
Since it's optional you would probably only expect to see it on the large datasets. You would have to do things in an awkward way to get the datasets in DynamicTables
for instance to be compressed. Is there a reason you can't just compress the whole HDF5 file when transferring?
For transfer - the original question to @mgrauer . But built in compression - for any storage and I would not be surprised if it would cause some operations speed up actually (eg ls) . Yet to investigate in practice
@yarikoptic
I'm not sure what you specifically mean by
does girder support receiving compressed payload
Girder considers files to be opaque blobs, so if you want to upload a compressed file or an uncompressed file, Girder doesn't care, nor will it know that the file is compressed or not.
This relates a bit to the discussion on ingest.
I meant something like https://en.m.wikipedia.org/wiki/HTTP_compression, where original file/blob is not compressed, client compresses it for the transfer and lets server (girder) know that the file/blob needs to be uncompressed upon receive.
Girder does not support this behavior out of the box.
Why not just have the client compress the file and upload and store the compressed file? What is the need to store the uncompressed file on the server?
This discussion has been good for generating requirements for describing an ingest pipeline! We can discuss more when we meet up in person at SfN.
I don't think we would pursue any extra compression ATM
In the light of https://github.com/dandi/dandi-cli/issues/21 it might be highly benefitial to compress files during upload
@mgrauer - does girder support receiving compressed payload?
@bendichter - do you have a quick way/code to assess if hdf5 file used compression, so we could include that into ls output and dynamically decide either to compress payload to girder?