Closed weiji14 closed 1 year ago
torchdata
IterableWrapper
OnDiskCacheHolder
HttpReader
FlatMapper
Demultiplexer
Mapper
Batcher
InBatchShuffler
Collator
Current datapipeline visualized using torchdata.datapipes.utils.to_graph(dp=dp_train):
torchdata.datapipes.utils.to_graph(dp=dp_train)
Ideally, the HDF5 files could be streamed directly from HuggingFace into an DataTree object (right now there is a download+cache step). There might be a way to do so using kerchunk.hdf.SingleHdf5ToZarr (which I've tried), but there are some weird errors that comes down to not knowing how the HDF5 files are stored on the HuggingFace Spaces Git LFS storage provider. Some discussion over at https://discourse.pangeo.io/t/accessing-nested-hdf5-file-from-http-via-kerchunk/3432.
kerchunk.hdf.SingleHdf5ToZarr
Adapted from some of my previous LightningDataModule code at:
See also torchgeo implementation at https://github.com/microsoft/torchgeo/pull/1259/files
What I am changing
How I did it
torchdata
IterDataPipes from https://pytorch.org/data/0.6/torchdata.datapipes.iter.htmlIterableWrapper
->OnDiskCacheHolder
->HttpReader
->FlatMapper
->Demultiplexer
->Mapper
->Batcher
->InBatchShuffler
->Collator
Current datapipeline visualized using
torchdata.datapipes.utils.to_graph(dp=dp_train)
:Ideally, the HDF5 files could be streamed directly from HuggingFace into an DataTree object (right now there is a download+cache step). There might be a way to do so using
kerchunk.hdf.SingleHdf5ToZarr
(which I've tried), but there are some weird errors that comes down to not knowing how the HDF5 files are stored on the HuggingFace Spaces Git LFS storage provider. Some discussion over at https://discourse.pangeo.io/t/accessing-nested-hdf5-file-from-http-via-kerchunk/3432.How you can test it
Related Issues
Adapted from some of my previous LightningDataModule code at:
See also torchgeo implementation at https://github.com/microsoft/torchgeo/pull/1259/files