OOM on preprocessing dataset with large number of documents

When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.

The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.

Traceback (most recent call last):
  File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
    main()
  File "Megatron-LM/tools/preprocess_data.py", line 224, in main
    builders[key].finalize(output_idx_files[key])
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
    index.write(self._sizes, self._doc_idx)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
    pointers = self._get_pointers(sizes)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
    pointers.append(address)
MemoryError

The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together

bigcode-project / Megatron-LM

OOM on preprocessing dataset with large number of documents #34