Describe the bug
The provided example script run_simple_mcore_train_loop.py throws errors in Step 3: GPT Mock dataset setup utility.
To Reproduce
For simplicity, the example is run with a single GPU with tensor_model_parallel_size=1 and pipeline_model_parallel_size=1.
srun python -u run_simple_mcore_train_loop.py
Stack trace/logs
[rank0]: Traceback (most recent call last):
[rank0]: File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 115, in <module>
[rank0]: train_iterator = get_train_data_iterator()
[rank0]: File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 55, in get_train_data_iterator
[rank0]: config = GPTDatasetConfig(
[rank0]: File "<string>", line 18, in __init__
[rank0]: File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 52, in __post_init__
[rank0]: super().__post_init__()
[rank0]: File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_config.py", line 87, in __post_init__
[rank0]: assert self.split is not None, "split must be provided in absence of blend_per_split"
[rank0]: AssertionError: split must be provided in absence of blend_per_split
Environment (please complete the following information):
Megatron-LM commit ID: a5534c8
PyTorch version: 2.4.0a0+gitd957c2d
CUDA version: 12.2
NCCL version: 2.19.4
Proposed fix
N/A
Additional context
Upon applying a temporary fix in get_train_data_iterator for the GPT config with split='1', other errors are thrown when creating an object of class MockGPTDataset. Additionally, this GPT config refers to a dummy tokenizer which is missing.
Any assistance with resolving this issue would be appreciated, thank you!
6c7bec6 fixes this issue partially. A split argument is required to be passed to GPTDatasetConfig (from BlendedMegatronDatasetConfig) if blend is None to make the example work.
Describe the bug The provided example script
run_simple_mcore_train_loop.py
throws errors in Step 3: GPT Mock dataset setup utility.To Reproduce For simplicity, the example is run with a single GPU with
tensor_model_parallel_size=1
andpipeline_model_parallel_size=1
.srun python -u run_simple_mcore_train_loop.py
Stack trace/logs
Environment (please complete the following information):
a5534c8
2.4.0a0+gitd957c2d
Proposed fix N/A
Additional context Upon applying a temporary fix in
get_train_data_iterator
for the GPT config withsplit='1'
, other errors are thrown when creating an object of class MockGPTDataset. Additionally, this GPT config refers to adummy
tokenizer which is missing.Any assistance with resolving this issue would be appreciated, thank you!