Check handling of `Dataloader` in eval task

kvantricht commented 1 month ago

We have to be really careful when running the eval task like this:

https://github.com/WorldCereal/presto-worldcereal/blob/ce3fae1bb1054ba0f8c60edb7b0a0edc76dbf3b2/presto/eval.py#L252:L277

We normally iterate manually through the dataset, like this:

for i in range(len(ds)):
     ... = ds[i]

which works also when we have duplicated indices for example for balancing. However, when just iterating through a Dataloader object, it does not seem like we are iterating through all indices.

kvantricht commented 1 month ago

In fact, it seems such a case could be easily tackled by adding our own __iter__ method in WorldCerealBase dataset, so initializing a normal Dataloader will just work as expected. What do you think @gabrieltseng @cbutsko ?

def __iter__(self):
        for idx in self.indices:
            yield self.__getitem__(idx)

Unless the whole idea is not to use duplicated indices when just using a Dataloader? However, I want to be able to duplicate indices to allow augmentation and still make use of large batch sizes and multiprocessing using Dataloader.

cbutsko commented 3 days ago

@kvantricht to check this again. Should we add it to tests as well?

WorldCereal / presto-worldcereal

Check handling of `Dataloader` in eval task #113