Open kvantricht opened 1 month ago
In fact, it seems such a case could be easily tackled by adding our own __iter__
method in WorldCerealBase
dataset, so initializing a normal Dataloader
will just work as expected. What do you think @gabrieltseng @cbutsko ?
def __iter__(self):
for idx in self.indices:
yield self.__getitem__(idx)
Unless the whole idea is not to use duplicated indices when just using a Dataloader? However, I want to be able to duplicate indices to allow augmentation and still make use of large batch sizes and multiprocessing using Dataloader.
@kvantricht to check this again. Should we add it to tests as well?
We have to be really careful when running the eval task like this:
https://github.com/WorldCereal/presto-worldcereal/blob/ce3fae1bb1054ba0f8c60edb7b0a0edc76dbf3b2/presto/eval.py#L252:L277
We normally iterate manually through the dataset, like this:
which works also when we have duplicated indices for example for balancing. However, when just iterating through a Dataloader object, it does not seem like we are iterating through all indices.