When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.
The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.
Traceback (most recent call last):
File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
main()
File "Megatron-LM/tools/preprocess_data.py", line 224, in main
builders[key].finalize(output_idx_files[key])
File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
index.write(self._sizes, self._doc_idx)
File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
pointers = self._get_pointers(sizes)
File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
pointers.append(address)
MemoryError
The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together
When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.
The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.
The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together