Modalities / modalities

Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.
MIT License
61 stars 5 forks source link

Forcing use of BatchSampler in DataLoader makes iterable-style datasets unusable #73

Closed BlueCrescent closed 6 months ago

BlueCrescent commented 7 months ago

We are currently forcing the use of a torch.utils.data.sampler.BatchSampler in our DataLoader:

https://github.com/Modalities/modalities/blob/080755503c12ba250b83ba2864d993d4a73dd934/src/modalities/dataloader/dataloader.py#L8-L30

Since a sampler produces indices, this is incompatible with iterable-style datasets (i.e. Datasets implementing torch.utils.data.IterableDataset).

le1nux commented 6 months ago

Thanks for the proposal.

Since we always use Mmap datasets as the underlying backbone for all of our datasets, we don't even want to support iterable-only datasets from my point of view. They would not allow for random sampling approaches and also for the data distribution across ranks we are relying on a distributed sampler, which would break with the FSDP implementation as far as I can tell.

I will close this, but feel free to reopen if we should discuss this further. 🙂 @BlueCrescent @mali-git