Open GenevieveBuckley opened 3 years ago
There are a bunch of other issues discussing ideas for specific example data, I'm linking to them here:
Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?
Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?
Pooch is only for downloading & extracting data. You give it a filename/url, and pooch fetches it for you.
If you want to query potions of a dataset, you'd need that dataset to be stored in some kind of chunked format to begin with, and some idea about how you want to do that querying. So it could be possible with a remote HDF5 (or zarr?) array.
One thing to consider would be download speed. I haven't done a bunch of testing, but it seems pretty common sense that zipped/tarred datasets will probably be transferred over the network quicker. So even with the extra time it takes to extract the data once it arrives, it might be quicker overall. That doesn't mean you have to do it that way, just one more thing to consider.
Yeah Zarr supports Zstandard, which is pretty efficiently compressed. There are some filesystems that use Zstandard. It's also something being explored with Conda packages as well for the same reason (faster downloads, smaller packages, etc.).
We can also query datasets directly from the cloud with Zarr. Here's an example dataset on S3 ( https://github.com/zarr-developers/zarr-python/issues/385#issuecomment-452447219 ).
We can also cache downloaded chunks locally to ensure we only pull from a cloud store once.
I think this really comes down to what size datasets would be used here. If they are small, maybe pooch is fine. If they are large, maybe Zarr would be better.
+1 for zarr wherever applicable
A discussion about synthetic data generation is here: https://github.com/napari/napari/issues/3608
I think demo dataset infrastructure would be useful.
I made a PR proposal for napari here: https://github.com/napari/napari/pull/3580 (it's based on scikit-image: they use pooch and like it)
We could have a combination of: