NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.95k stars 2.25k forks source link

[BUG] Precision issue while having more than `2.6T` documents in a bin and idx file due to `int32_t` usage and assertion in `gpt_dataset` line 427. #1056

Open harishankar-gopalan opened 2 weeks ago

harishankar-gopalan commented 2 weeks ago

Describe the bug Int32 limit for number of documents and samples within one indexed dataset bin and idx.

We created one large bin and idx file and used the same for a training run. The dataset, was loaded through blended_megatron_dataset_builder backed by the mid level gpt_dataset which in turn is backed by the low level indexed_dataset.

The exact issue starts from these lines and propagates through all the below mentioned code path:

  1. blended_magatron_dataset_builder._build_megatron_dataset_splits
  2. gpt_dataset._build_document_index
  3. gpt_dataset._build_document_sample_shuffle_indices - line:427
  4. gpt_dataset._build_document_sample_shuffle_indices - line:440
  5. helpers.cpp#build_sample_idx
  6. Also another observation about [helpers.cpp]() is that in the version core_v0.7.0 there are few other parts of the method mentioned in the previous point that use int32_t, for eg:
  7. Although the above occurrences of int32_t in helpers.cpp seem to be changed to int64_t in core_r0.8.0, is not sufficient to fix the whole precision problem while using one large .bin and .idx files.
  8. I can surely contribute a fix for the blended_dataset, gpt_dataset and indexed_dataset part of it and even started working on it, but I can see that there are quite few more usages of blended_dataset and indexed_dataset in other mid-level datasets like Bert based datasets. I am not sure how well I would be able to take those into consideration as I have not used those datasets within NeMo and do not know the full working of those datasets and might need some insider help on those.

To Reproduce Create one large bin/idx file which can create number of documents larger than int32 range.

Expected behavior Data loading should work fine even if one bin/idx file contains documents count more than int32 range.

Stack trace/logs N/A, we will see weird behavior where we get negative indices and that leads to very large documents being tried to be loaded. It becomes very complicated to even understand that the issue is int32 range issue.

Environment (please complete the following information):

Proposed fix Based on the number of samples in the bin/idx either in32_t or in64_t should be automatically chosen. I have created a proposed patch here.

jkamalu commented 1 week ago

Thanks for opening this bug. Have you been able to test your proposed fix @harishankar-gopalan ?

harishankar-gopalan commented 1 week ago

Hi John,

Yes we have run a training using the proposed fix. I can run a re test if required and keep you posted here. But my larger point is the fix does not address a possible similar issue in BertDataset and other datasets that are used as a mid level dataset inside BlendedDataset.