NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.65k stars 2.39k forks source link

[BUG] error raised while converting llm to megatron #992

Open KookHoiKim opened 3 months ago

KookHoiKim commented 3 months ago

Describe the bug I followed llama_mistral.md using mistral 7b model. (also using llama model too) However, it raises error below.

WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float32 for parameters ...
Process Process-1:
Traceback (most recent call last):
  File "/workspace/code/Megatron-LM/tools/checkpoint/saver_mcore.py", line 449, in save_checkpoint
    validate_args(margs)
  File "/workspace/code/Megatron-LM/megatron/training/arguments.py", line 559, in validate_args
    raise RuntimeError('--use-dist-ckpt is not supported in legacy models.')
RuntimeError: --use-dist-ckpt is not supported in legacy models.

If i modify manually use_legacy_models = False, another error occurs.

Traceback (most recent call last):
  File "/workspace/code/Megatron-LM/tools/checkpoint/saver_mcore.py", line 790, in save_checkpoint
    save_checkpoint(md.iteration, [get_local_model(pp_rank, ep_rank, tp_rank)], None, None, num_floating_point_operations_so_far=0,
  File "/workspace/code/Megatron-LM/megatron/training/checkpointing.py", line 396, in save_checkpoint
    save_strategy = FullyParallelSaveStrategyWrapper(save_strategy, mpu.get_data_parallel_group(with_context_parallel=True),
  File "/workspace/code/Megatron-LM/megatron/core/parallel_state.py", line 917, in get_data_parallel_group
    _DATA_PARALLEL_GROUP_WITH_CP is not None
AssertionError: data parallel group with context parallel combined is not initialized

FYI I remember that this issue is occured after pulling recent commit of main branch. On 0b4c4cfced47cffad4cec8c4047986bfa60e7f10 commit, error is not occured.

github-actions[bot] commented 1 month ago

Marking as stale. No activity in 60 days.