Forcing use of BatchSampler in DataLoader makes iterable-style datasets unusable

Modalities / modalities

Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.

MIT License

61 stars 5 forks source link

Thanks for the proposal.

Since we always use Mmap datasets as the underlying backbone for all of our datasets, we don't even want to support iterable-only datasets from my point of view. They would not allow for random sampling approaches and also for the data distribution across ranks we are relying on a distributed sampler, which would break with the FSDP implementation as far as I can tell.

I will close this, but feel free to reopen if we should discuss this further. 🙂 @BlueCrescent @mali-git

Modalities / modalities

Forcing use of BatchSampler in DataLoader makes iterable-style datasets unusable #73