NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUG] Megatron Core example not working #855

Open schheda1 opened 3 weeks ago

schheda1 commented 3 weeks ago

Describe the bug The provided example script run_simple_mcore_train_loop.py throws errors in Step 3: GPT Mock dataset setup utility.

To Reproduce For simplicity, the example is run with a single GPU with tensor_model_parallel_size=1 and pipeline_model_parallel_size=1. srun python -u run_simple_mcore_train_loop.py

Stack trace/logs

[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 115, in <module>
[rank0]:     train_iterator = get_train_data_iterator()
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 55, in get_train_data_iterator
[rank0]:     config = GPTDatasetConfig(
[rank0]:   File "<string>", line 18, in __init__
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 52, in __post_init__
[rank0]:     super().__post_init__()
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_config.py", line 87, in __post_init__
[rank0]:     assert self.split is not None, "split must be provided in absence of blend_per_split"
[rank0]: AssertionError: split must be provided in absence of blend_per_split

Environment (please complete the following information):

Proposed fix N/A

Additional context Upon applying a temporary fix in get_train_data_iterator for the GPT config with split='1', other errors are thrown when creating an object of class MockGPTDataset. Additionally, this GPT config refers to a dummy tokenizer which is missing.

Any assistance with resolving this issue would be appreciated, thank you!

Qinghao-Hu commented 3 weeks ago

I also meet this problem

windprak commented 2 weeks ago

same

schheda1 commented 1 week ago

6c7bec6 fixes this issue partially. A split argument is required to be passed to GPTDatasetConfig (from BlendedMegatronDatasetConfig) if blend is None to make the example work.