Performance optimization on Megatron-DeepSpeed data loader
This PR is to implement the distributed data loading for Metagron-DeepSpeed. The original data loading does not scale well and has given us a lot of pains in the past. I suggest waiting until I do some more testing to merge it in case it impacts our production runs. But the folks are very welcome to take a look and provide feedbacks.
After all the datasets were built on all the ranks, it then built the BlendableDataset on top of that. This part is rather expensive if the number of dataset files are large (say 1000 files), which may takes about hours to generate the indices for BlendableDataset. The key function building_indices is very expensive if we have large amount of dataset files. It is almost proportional to the number of files. For loading 2T tokens with (4k seq length), it will take about 1s for 1 file, where as 2000s for 2419 files (dolma v1.7).
This on the fly building the dataset object will only do once.
This on the fly building the dataset object will be done through the background thread. So it will be hidden behind the training.
For build_indices issue, the trick is to reduce the number of datasets in the BlendableDataset construction. What I did was concat all the datasets that belong to the same corpus into a single dataset, and then build the BlendableDataset on top of that. I have solved some subtle issues on the way to achieve this. This is addressed by #18
Performance Improvement
Overall, with the new version of the data loader, it achieves 20x speed up on the data loader for loading 2000,0000,000,000 tokens (dolma v1.7)
The experiments were done on Sunspot from 1 nodes to 16 nodes. We see consistent performance improvement up to 20x.
Performance optimization on Megatron-DeepSpeed data loader
This PR is to implement the distributed data loading for Metagron-DeepSpeed. The original data loading does not scale well and has given us a lot of pains in the past. I suggest waiting until I do some more testing to merge it in case it impacts our production runs. But the folks are very welcome to take a look and provide feedbacks.
Performance issue with the original code:
Deep speed is trying to build the blendable dataset in the very beginning https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/13171c23c00937d30430c422a3f33ba573c670fb/megatron/data/gpt_dataset.py#L72 on all the ranks concurrently. This does not scale well because of the following reasons:
Solution
Performance Improvement
Overall, with the new version of the data loader, it achieves 20x speed up on the data loader for loading 2000,0000,000,000 tokens (dolma v1.7)
The experiments were done on Sunspot from 1 nodes to 16 nodes. We see consistent performance improvement up to 20x.
My version 1 gets the green bars down but not the yellow bars https://github.com/argonne-lcf/Megatron-DeepSpeed/tree/distributed_loading My version 2 gets the yellow bars down by 100x by grouping all the datasets belong to the same corpus together https://github.com/argonne-lcf/Megatron-DeepSpeed/tree/distributed_loading_v2
The performance evaluation is shown here: md_distributed_dataloader.pdf
Changes needed