Performance optimization on Megatron-DeepSpeed data loader

This PR is to implement the distributed data loading for Metagron-DeepSpeed. The original data loading does not scale well and has given us a lot of pains in the past. I suggest waiting until I do some more testing to merge it in case it impacts our production runs. But the folks are very welcome to take a look and provide feedbacks.

Performance issue with the original code:

Deep speed is trying to build the blendable dataset in the very beginning https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/13171c23c00937d30430c422a3f33ba573c670fb/megatron/data/gpt_dataset.py#L72 on all the ranks concurrently. This does not scale well because of the following reasons:

Only rank 0 processes the meta data and save it to the file system for all the files https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/13171c23c00937d30430c422a3f33ba573c670fb/megatron/data/gpt_dataset.py#L416. If the number of files increases, this step is slow.
After rank 0 writes the meta data, all the ranks will then read the same meta data concurrently https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/13171c23c00937d30430c422a3f33ba573c670fb/megatron/data/gpt_dataset.py#L519. This is so bad if all the ranks are trying to read a lot of small files simultaneously, which we have already observed in importing python modules at scale.
After all the datasets were built on all the ranks, it then built the BlendableDataset on top of that. This part is rather expensive if the number of dataset files are large (say 1000 files), which may takes about hours to generate the indices for BlendableDataset. The key function building_indices is very expensive if we have large amount of dataset files. It is almost proportional to the number of files. For loading 2T tokens with (4k seq length), it will take about 1s for 1 file, where as 2000s for 2419 files (dolma v1.7).

Solution

Each rank will build dataset objects only a subset of the files https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/3e2aa234e46aa4ef65f30167e0a1a3efc5904725/megatron/data/gpt_dataset.py#L92. Each will then write the meta data info into the disc.
In the data loading process, if the rank needs the samples from the files that was built by its own, it will directly get the data through the built dataset objects https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/3e2aa234e46aa4ef65f30167e0a1a3efc5904725/megatron/data/blendable_dataset.py#L131. Otherwise, the rank will build the dataset object on the fly based on the meta data that is already written by other ranks (), and then get the data https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/3e2aa234e46aa4ef65f30167e0a1a3efc5904725/megatron/data/blendable_dataset.py#L135.
- This on the fly building the dataset object will only do once.
- This on the fly building the dataset object will be done through the background thread. So it will be hidden behind the training.
For build_indices issue, the trick is to reduce the number of datasets in the BlendableDataset construction. What I did was concat all the datasets that belong to the same corpus into a single dataset, and then build the BlendableDataset on top of that. I have solved some subtle issues on the way to achieve this. This is addressed by #18

Performance Improvement

Overall, with the new version of the data loader, it achieves 20x speed up on the data loader for loading 2000,0000,000,000 tokens (dolma v1.7)

The experiments were done on Sunspot from 1 nodes to 16 nodes. We see consistent performance improvement up to 20x.

My version 1 gets the green bars down but not the yellow bars https://github.com/argonne-lcf/Megatron-DeepSpeed/tree/distributed_loading My version 2 gets the yellow bars down by 100x by grouping all the datasets belong to the same corpus together https://github.com/argonne-lcf/Megatron-DeepSpeed/tree/distributed_loading_v2

The performance evaluation is shown here: md_distributed_dataloader.pdf

Changes needed

One has to change the --data-file-list to include corpus information of the dataset file
```
w1 prefix1 c1
w2 prefix2 c2 
w3 prefix3 c3
...
```
Removing the old caching folder
Building the helpers.cpp BEFORE running Megatron-DeepSpeed
```
cd megatron/data
make
```

argonne-lcf / Megatron-DeepSpeed

Distributed data loading #16

Performance optimization on Megatron-DeepSpeed data loader

Performance issue with the original code:

Solution

Performance Improvement

Changes needed