ESA-PhiLab / Major-TOM

Expandable Datasets for Earth Observation
https://huggingface.co/Major-TOM
129 stars 7 forks source link

Add support for torch DataLoader and custom torchvision image transforms #4

Closed miquel-espinosa closed 3 months ago

miquel-espinosa commented 3 months ago

Hi. Thanks for the great work. I think it would be very handy to support torch.dataloader in MajorTOM class, such that it can be loaded like below.

train_dataset = MajorTOM(df=filtered_df, local_dir='path/datasets/majorTOM/euro/L2A',
                                              tif_bands=[], png_bands=['thumbnail'])

train_dataset_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=False,
    drop_last=False,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
)

In the current version, trying to directly use torch.utils.data.Dataloader will yield errors, as variable meta in __getitem__ is not tensor-friendly, nor the images from the bands.

Metadata

I added a small helper function to convert attributes to tensors. Note, however, that there are some design decisions to consider (e.g. removing point geometry, datetime to float timestamp).

This is the naive implementation but feel free to edit as much as you want.

Images

It would be nice to allow torchvision.transforms. I have added very simple code that allows adding custom transforms to MajorTOM, for tifs and pngs respectively. By default, ToTensor().

Again, feel free to do any edits/comments on this.

mikonvergence commented 3 months ago

Hi @miquel-espinosa - this is great work, thank you! Indeed, we have not tested it with the data loader yet, so as you said, some of the data is not immediately ready to be put into batches and some changes are required.

We will review these changes soon in detail, but after a quick look, I think it would be nice to leave __getitem__ as is to preserve the original formatting of the metadata (we could also add a parameter that controls whether metadata is returned at all) and then design a collate_fn that can then be used with a dataloader, which takes care of transforming the data into batches in one way or another (that way, we can easily define many different ways of batching the samples without making the main dataset class too convoluted).

I'll get back to you soon once I investigate in more detail!