Demo dataset infrastructure

dask / dask-image

Distributed image processing

http://image.dask.org/en/latest/

BSD 3-Clause "New" or "Revised" License

210 stars 47 forks source link

Demo dataset infrastructure #246

Open GenevieveBuckley opened 3 years ago

GenevieveBuckley commented 3 years ago

I think demo dataset infrastructure would be useful.

I made a PR proposal for napari here: https://github.com/napari/napari/pull/3580 (it's based on scikit-image: they use pooch and like it)

We could have a combination of:

Experimental datasets, and
Synthetic datasets (might be quicker to generate very large images than it is to download them - they just need to have interesting structures)

GenevieveBuckley commented 3 years ago

There are a bunch of other issues discussing ideas for specific example data, I'm linking to them here:

jakirkham commented 3 years ago

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

GenevieveBuckley commented 3 years ago

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

Pooch is only for downloading & extracting data. You give it a filename/url, and pooch fetches it for you.

If you want to query potions of a dataset, you'd need that dataset to be stored in some kind of chunked format to begin with, and some idea about how you want to do that querying. So it could be possible with a remote HDF5 (or zarr?) array.

One thing to consider would be download speed. I haven't done a bunch of testing, but it seems pretty common sense that zipped/tarred datasets will probably be transferred over the network quicker. So even with the extra time it takes to extract the data once it arrives, it might be quicker overall. That doesn't mean you have to do it that way, just one more thing to consider.

jakirkham commented 3 years ago

Yeah Zarr supports Zstandard, which is pretty efficiently compressed. There are some filesystems that use Zstandard. It's also something being explored with Conda packages as well for the same reason (faster downloads, smaller packages, etc.).

We can also query datasets directly from the cloud with Zarr. Here's an example dataset on S3 ( https://github.com/zarr-developers/zarr-python/issues/385#issuecomment-452447219 ).

We can also cache downloaded chunks locally to ensure we only pull from a cloud store once.

I think this really comes down to what size datasets would be used here. If they are small, maybe pooch is fine. If they are large, maybe Zarr would be better.

GenevieveBuckley commented 3 years ago

+1 for zarr wherever applicable

GenevieveBuckley commented 3 years ago

A discussion about synthetic data generation is here: https://github.com/napari/napari/issues/3608