NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUG] Preprocess_data.py does not finalize all keys #852

Closed zainsarwar865 closed 3 weeks ago

zainsarwar865 commented 3 weeks ago

In the tools/preprocess_data.py there is a function process_json_files which will read json files, encode them and save them in .bin/.idx format. While this reads every json key and creats a MMapIndexedDatasetBuilder for every key, it does not call the finalize function for every key (since it is missing a for loop). This prevents the index from being created.