:monocle_face: DataPipe for loading ChaBuD 2023 HDF5 files - Githubissues

developmentseed / chabud2023

Change detection for Burned area Delineation (ChaBuD) ECML/PKDD 2023 challenge

Other

5 stars 1 forks source link

:monocle_face: DataPipe for loading ChaBuD 2023 HDF5 files #4

Closed weiji14 closed 1 year ago

weiji14 commented 1 year ago

What I am changing

A LightningDataModule for loading HDF5 files from the ChaBuD 2023 challenge website at:
- https://huggingface.co/datasets/chabud-team/chabud-ecml-pkdd2023/tree/main
- https://huggingface.co/datasets/chabud-team/chabud-extra/tree/main

How I did it

By using lots of good torchdata IterDataPipes from https://pytorch.org/data/0.6/torchdata.datapipes.iter.html
IterableWrapper -> OnDiskCacheHolder -> HttpReader -> FlatMapper -> Demultiplexer -> Mapper -> Batcher -> InBatchShuffler -> Collator

Current datapipeline visualized using torchdata.datapipes.utils.to_graph(dp=dp_train):

hdf5datapipeline

Ideally, the HDF5 files could be streamed directly from HuggingFace into an DataTree object (right now there is a download+cache step). There might be a way to do so using kerchunk.hdf.SingleHdf5ToZarr (which I've tried), but there are some weird errors that comes down to not knowing how the HDF5 files are stored on the HuggingFace Spaces Git LFS storage provider. Some discussion over at https://discourse.pangeo.io/t/accessing-nested-hdf5-file-from-http-via-kerchunk/3432.

How you can test it

Related Issues

Adapted from some of my previous LightningDataModule code at:

See also torchgeo implementation at https://github.com/microsoft/torchgeo/pull/1259/files