[BUG] Precision issue while having more than `2.6T` documents in a bin and idx file due to `int32_t` usage and assertion in `gpt_dataset` line 427.

harishankar-gopalan commented 2 weeks ago

Describe the bug Int32 limit for number of documents and samples within one indexed dataset bin and idx.

We created one large bin and idx file and used the same for a training run. The dataset, was loaded through blended_megatron_dataset_builder backed by the mid level gpt_dataset which in turn is backed by the low level indexed_dataset.

The exact issue starts from these lines and propagates through all the below mentioned code path:

blended_magatron_dataset_builder._build_megatron_dataset_splits
gpt_dataset._build_document_index
gpt_dataset._build_document_sample_shuffle_indices - line:427
gpt_dataset._build_document_sample_shuffle_indices - line:440
helpers.cpp#build_sample_idx
Also another observation about [helpers.cpp]() is that in the version core_v0.7.0 there are few other parts of the method mentioned in the previous point that use int32_t, for eg:
- helpers.cpp#175
- helpers.cpp#231
Although the above occurrences of int32_t in helpers.cpp seem to be changed to int64_t in core_r0.8.0, is not sufficient to fix the whole precision problem while using one large .bin and .idx files.
I can surely contribute a fix for the blended_dataset, gpt_dataset and indexed_dataset part of it and even started working on it, but I can see that there are quite few more usages of blended_dataset and indexed_dataset in other mid-level datasets like Bert based datasets. I am not sure how well I would be able to take those into consideration as I have not used those datasets within NeMo and do not know the full working of those datasets and might need some insider help on those.

To Reproduce Create one large bin/idx file which can create number of documents larger than int32 range.

Expected behavior Data loading should work fine even if one bin/idx file contains documents count more than int32 range.

Stack trace/logs N/A, we will see weird behavior where we get negative indices and that leads to very large documents being tried to be loaded. It becomes very complicated to even understand that the issue is int32 range issue.

Environment (please complete the following information):

Megatron-LM core_v0.7.0 and core_v0.8.0
PyTorch 2.x (but immaterial to reproduce this behavior)
CUDA 12.x (but immaterial to reproduce this behavior)
NCCL 2.x (but immaterial to reproduce this behavior)

Proposed fix Based on the number of samples in the bin/idx either in32_t or in64_t should be automatically chosen. I have created a proposed patch here.

jkamalu commented 1 week ago

Thanks for opening this bug. Have you been able to test your proposed fix @harishankar-gopalan ?

harishankar-gopalan commented 1 week ago

Hi John,

Yes we have run a training using the proposed fix. I can run a re test if required and keep you posted here. But my larger point is the fix does not address a possible similar issue in BertDataset and other datasets that are used as a mid level dataset inside BlendedDataset.

NVIDIA / Megatron-LM

[BUG] Precision issue while having more than `2.6T` documents in a bin and idx file due to `int32_t` usage and assertion in `gpt_dataset` line 427. #1056