Open 1195343015 opened 3 months ago
Do you have any solution? I got the same error.
Do you have any solution? I got the same error.
I think this bug is due to inappropriate default max_sequence_length
in MockGPTLowLevelDataset
, which is used to generate mockdataset.
https://github.com/NVIDIA/Megatron-LM/blob/732a689606810c02d0dc260a163c9ebac099c044/megatron/core/datasets/gpt_dataset.py#L693-L697
The default max_sequence_length
is 4096
. You can modify it to 64
, which makes it the same as run_simple_mcore_train_loop.py
https://github.com/NVIDIA/Megatron-LM/blob/732a689606810c02d0dc260a163c9ebac099c044/examples/run_simple_mcore_train_loop.py#L21
Hope it could be helpful for you.
@1195343015 Thanks a lot. It works for me. I hope this can be fixed more robustly.
Marking as stale. No activity in 60 days.
Describe the bug https://github.com/NVIDIA/Megatron-LM/blob/01ca03f11e89f4f85682dcac647c2b913b25fcee/examples/run_simple_mcore_train_loop.py#L118 When I moditied
tensor_model_parallel_size
inrun_simple_mcore_train_loop.py
from2
to1
, some bugs happened.Stack trace/logs
Environment (please complete the following information):