argonne-lcf / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
9 stars 12 forks source link

Distributed data loading #16

Closed zhenghh04 closed 4 months ago

zhenghh04 commented 5 months ago

Performance optimization on Megatron-DeepSpeed data loader

This PR is to implement the distributed data loading for Metagron-DeepSpeed. The original data loading does not scale well and has given us a lot of pains in the past. I suggest waiting until I do some more testing to merge it in case it impacts our production runs. But the folks are very welcome to take a look and provide feedbacks.

Performance issue with the original code:

Deep speed is trying to build the blendable dataset in the very beginning https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/13171c23c00937d30430c422a3f33ba573c670fb/megatron/data/gpt_dataset.py#L72 on all the ranks concurrently. This does not scale well because of the following reasons:

Solution

Performance Improvement

Overall, with the new version of the data loader, it achieves 20x speed up on the data loader for loading 2000,0000,000,000 tokens (dolma v1.7)

The experiments were done on Sunspot from 1 nodes to 16 nodes. We see consistent performance improvement up to 20x.

My version 1 gets the green bars down but not the yellow bars https://github.com/argonne-lcf/Megatron-DeepSpeed/tree/distributed_loading My version 2 gets the yellow bars down by 100x by grouping all the datasets belong to the same corpus together https://github.com/argonne-lcf/Megatron-DeepSpeed/tree/distributed_loading_v2

The performance evaluation is shown here: md_distributed_dataloader.pdf

Changes needed

saforem2 commented 4 months ago

LGTM 👍 🚀