[ENHANCEMENT] S3 data loading for MMapIndexedDataset

jrocmar commented 7 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Megatron-LM does not support loading a dataset from S3.

Describe the solution you'd like A clear and concise description of what you want to happen.

I would like to extend MMapIndexedDataset to support S3 data loading.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

A user can download the dataset from S3 to the local file system at the start of training, but that blocks training until the download is complete. Alternatively, a user can store their dataset in a cloud file system. That can work well, but requires managing the cloud file system in addition to S3.

Proposed implementation If you have a proposed implementation for the feature state it here or link to a PR.

I implemented S3 data loading in a private fork of NeMo initially. I have a public, draft PR with that implementation here. However, NeMo now uses MMapIndexedDataset in Megatron-LM directly, so ~~I would like to port a similar implementation to Megatron-LM~~ I also ported a similar implementation to Megatron-LM here. In particular, the index file is downloaded to the local file system so that it can be memory mapped and the bin file is streamed in chunks from S3. Note that the block shuffling functionality described in the NeMo PR is optional (we can just assume that the user has preshuffled the dataset).

Additional context Add any other context or screenshots about the feature request here.

ertkonuk commented 7 months ago

@jkamalu

jrocmar commented 6 months ago

@jkamalu: I'd love to get your feedback on this approach. Thanks in advance!

ahn1340 commented 6 months ago

I have been also training models using S3 data streaming by customizing the code as @jrocmar did, and it is interesting to see that other people have been doing the same. I think this would indeed be a nice feature to be supported officially.

github-actions[bot] commented 4 months ago

Marking as stale. No activity in 60 days.

NVIDIA / Megatron-LM

[ENHANCEMENT] S3 data loading for MMapIndexedDataset #698