⚠️ This discussion is based on the old version of Megatron. This problem still exists in the new version of Megatron.
1. Background
Megatron's data blending algorithm is based on a function in helpers.cpp: build_blending_indices. This function IS NOT ROBUST, and users can construct extreme test cases to cause bugs in Megatron's training.
Specifically, in the __getitem__ method of the BlendableDataset class, sample_idx >= len(self.datasets[dataset_idx]) may appear(Because there is no limit on the growth of current_samples), causing the index to go out of bounds.
Consider two identical datasets being blended, but the weights assigned to them are all extreme, one approaching 0 and one approaching 1, such as 0.001 and 0.999. In this case, according to the algorithm in build_blending_indices, the dataset with the smaller weight will be sampled for more than one epoch, thus causing an IndexError.
2. How to trigger this bug?
Assume that we have generated the corresponding binary files: data_text_document.bin and data_text_document.idx. The files do not need to be too large, just appropriate.
It is obvious that the size of the first dataset is only 1060, but sample_idx has exceeded this value in several places(1161, 1162, 1163), which will inevitably cause the index to go out of bounds.
If we fix to only take the last sample of the BlendableDataset each time, the bug can be triggered stably:
def __getitem__(self, idx):
idx = self.size - 1 # Always take the last sample of BlendableDataset
dataset_idx = self.dataset_index[idx]
sample_idx = self.dataset_sample_index[idx]
return {
"dataset_idx" : dataset_idx,
**self.datasets[dataset_idx][sample_idx], # "text": ndarray
}
3. How to fix?
Fixing this bug is not difficult, just let sample_idx take the modulus of the length of the corresponding dataset.
1. Background
Megatron's data blending algorithm is based on a function in helpers.cpp:
build_blending_indices
. This function IS NOT ROBUST, and users can construct extreme test cases to cause bugs in Megatron's training.Specifically, in the
__getitem__
method of theBlendableDataset
class,sample_idx >= len(self.datasets[dataset_idx])
may appear(Because there is no limit on the growth ofcurrent_samples
), causing the index to go out of bounds.Consider two identical datasets being blended, but the weights assigned to them are all extreme, one approaching 0 and one approaching 1, such as 0.001 and 0.999. In this case, according to the algorithm in
build_blending_indices
, the dataset with the smaller weight will be sampled for more than one epoch, thus causing anIndexError
.2. How to trigger this bug?
Assume that we have generated the corresponding binary files:
data_text_document.bin
anddata_text_document.idx
. The files do not need to be too large, just appropriate.Use the following code to examine this dataset:
Output:
It can be seen that this dataset has a total of 100 documents.
Consider blending two identical datasets and set the
--train-data-path
parameter in the training script like this:Then print the following information in the source code of
BlendableDataset
for observation:Then run the training script and first observe the output:
It is obvious that the size of the first dataset is only 1060, but
sample_idx
has exceeded this value in several places(1161, 1162, 1163), which will inevitably cause the index to go out of bounds.If we fix to only take the last sample of the
BlendableDataset
each time, the bug can be triggered stably:3. How to fix?
Fixing this bug is not difficult, just let
sample_idx
take the modulus of the length of the corresponding dataset.