Closed BlueCrescent closed 6 months ago
Thanks for the proposal.
Since we always use Mmap datasets as the underlying backbone for all of our datasets, we don't even want to support iterable-only datasets from my point of view. They would not allow for random sampling approaches and also for the data distribution across ranks we are relying on a distributed sampler, which would break with the FSDP implementation as far as I can tell.
I will close this, but feel free to reopen if we should discuss this further. 🙂 @BlueCrescent @mali-git
We are currently forcing the use of a torch.utils.data.sampler.BatchSampler in our DataLoader:
https://github.com/Modalities/modalities/blob/080755503c12ba250b83ba2864d993d4a73dd934/src/modalities/dataloader/dataloader.py#L8-L30
Since a sampler produces indices, this is incompatible with iterable-style datasets (i.e. Datasets implementing torch.utils.data.IterableDataset).