Open harishankar-gopalan opened 2 weeks ago
Thanks for opening this bug. Have you been able to test your proposed fix @harishankar-gopalan ?
Hi John,
Yes we have run a training using the proposed fix. I can run a re test if required and keep you posted here. But my larger point is the fix does not address a possible similar issue in BertDataset and other datasets that are used as a mid level dataset inside BlendedDataset.
Describe the bug Int32 limit for number of documents and samples within one indexed dataset bin and idx.
We created one large bin and idx file and used the same for a training run. The dataset, was loaded through blended_megatron_dataset_builder backed by the mid level gpt_dataset which in turn is backed by the low level indexed_dataset.
The exact issue starts from these lines and propagates through all the below mentioned code path:
core_v0.7.0
there are few other parts of the method mentioned in the previous point that use int32_t, for eg:int32_t
inhelpers.cpp
seem to be changed toint64_t
incore_r0.8.0
, is not sufficient to fix the whole precision problem while using one large.bin
and.idx
files.blended_dataset
,gpt_dataset
andindexed_dataset
part of it and even started working on it, but I can see that there are quite few more usages ofblended_dataset
andindexed_dataset
in other mid-level datasets like Bert based datasets. I am not sure how well I would be able to take those into consideration as I have not used those datasets within NeMo and do not know the full working of those datasets and might need some insider help on those.To Reproduce Create one large bin/idx file which can create number of documents larger than int32 range.
Expected behavior Data loading should work fine even if one bin/idx file contains documents count more than int32 range.
Stack trace/logs N/A, we will see weird behavior where we get negative indices and that leads to very large documents being tried to be loaded. It becomes very complicated to even understand that the issue is int32 range issue.
Environment (please complete the following information):
core_v0.7.0
andcore_v0.8.0
Proposed fix Based on the number of samples in the bin/idx either in32_t or in64_t should be automatically chosen. I have created a proposed patch here.