Open jamesbornholt opened 10 months ago
Related pull request for Megatron: https://github.com/NVIDIA/Megatron-LM/pull/729
The torchdata IterableWrapper
is being deprecated in a future release, but it will still be present in PyTorch core. I've updated the code example above to point to that instead.
We currently don't have a built in way to do sharding for
S3IterableDataset
, so every worker process in aDataLoader
will see the same stream of objects. We should have a way to do this.In the meantime, something like this from
torchdata
will work as a workaround: