dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
21 stars 25 forks source link

upload: option to compress traffic #23

Closed yarikoptic closed 3 years ago

yarikoptic commented 5 years ago

In the light of https://github.com/dandi/dandi-cli/issues/21 it might be highly benefitial to compress files during upload

@mgrauer - does girder support receiving compressed payload?

@bendichter - do you have a quick way/code to assess if hdf5 file used compression, so we could include that into ls output and dynamically decide either to compress payload to girder?

bendichter commented 5 years ago

@yarikoptic HDF5 compression is done by dataset. You could query each dataset to see if it is compressed. Is that what you want to do? pynwb does not have a flag that compresses all datasets. Alternatively, we could just automatically compress the entire HDF5, but that would not require any HDF5 programming.

yarikoptic commented 5 years ago

Yep, it is per dataset, but unlikely any user would need/like to choose it per each dataset. I guess for sensing it would be enough to check some (eg 10) datasets within a file and treat compression of any of them as indicator that likely compression was used

yarikoptic commented 5 years ago

As for pynwb, I will suggest enabling compression by default (thus for all datasets) unless there was targeted investigation of typical cases which showed significant performance hit on topical operations

bendichter commented 5 years ago

It has been a design decision of pynwb to leave datasets plain by default. That means no compression and no chunking. If a user wants compression or chunking they must specify that for each dataset. What is the motivation behind checking datasets to see if some of them have been compressed?

yarikoptic commented 5 years ago

Motivation is the observed up to 90% storage and possibly traffic waste.

yarikoptic commented 5 years ago

Related observation - in neuroimaging majority of data is compressed (.nii.gz) although uncompressed is an option and used (rarely) for memory mapped access.

yarikoptic commented 5 years ago

Re design decision - was there some open discussion or document describing reasoning? I might indeed be fighting windmills if compression would complicate some use cases or cause significant performance degradation. But it would be great to see some reasoning

bendichter commented 5 years ago

Ok well checking a few isn't going to tell you whether the biggest ones are compressed, since the command must be made for each dataset. I can't think of any strong reasons why datasets shouldn't be compressed by default. I like the idea of chunking by default because it would allow us to grow datasets in append mode. Good luck.

yarikoptic commented 5 years ago

ok then, we will add a mode to ls to get % of compressed datasets. My wild bet is that it is either 0 or very rarely close to 100% and nothing in the middle ;-) since you are the one producing many of them, you can beat me to it and prove that I am wrong! ;-)

bendichter commented 5 years ago

Since it's optional you would probably only expect to see it on the large datasets. You would have to do things in an awkward way to get the datasets in DynamicTables for instance to be compressed. Is there a reason you can't just compress the whole HDF5 file when transferring?

yarikoptic commented 5 years ago

For transfer - the original question to @mgrauer . But built in compression - for any storage and I would not be surprised if it would cause some operations speed up actually (eg ls) . Yet to investigate in practice

mgrauer commented 5 years ago

@yarikoptic

I'm not sure what you specifically mean by

does girder support receiving compressed payload

Girder considers files to be opaque blobs, so if you want to upload a compressed file or an uncompressed file, Girder doesn't care, nor will it know that the file is compressed or not.

This relates a bit to the discussion on ingest.

yarikoptic commented 5 years ago

I meant something like https://en.m.wikipedia.org/wiki/HTTP_compression, where original file/blob is not compressed, client compresses it for the transfer and lets server (girder) know that the file/blob needs to be uncompressed upon receive.

mgrauer commented 5 years ago

Girder does not support this behavior out of the box.

Why not just have the client compress the file and upload and store the compressed file? What is the need to store the uncompressed file on the server?

This discussion has been good for generating requirements for describing an ingest pipeline! We can discuss more when we meet up in person at SfN.

yarikoptic commented 3 years ago

I don't think we would pursue any extra compression ATM