Describe the bug
I followed llama_mistral.md using mistral 7b model. (also using llama model too)
However, it raises error below.
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float32 for parameters ...
Process Process-1:
Traceback (most recent call last):
File "/workspace/code/Megatron-LM/tools/checkpoint/saver_mcore.py", line 449, in save_checkpoint
validate_args(margs)
File "/workspace/code/Megatron-LM/megatron/training/arguments.py", line 559, in validate_args
raise RuntimeError('--use-dist-ckpt is not supported in legacy models.')
RuntimeError: --use-dist-ckpt is not supported in legacy models.
If i modify manually use_legacy_models = False, another error occurs.
Traceback (most recent call last):
File "/workspace/code/Megatron-LM/tools/checkpoint/saver_mcore.py", line 790, in save_checkpoint
save_checkpoint(md.iteration, [get_local_model(pp_rank, ep_rank, tp_rank)], None, None, num_floating_point_operations_so_far=0,
File "/workspace/code/Megatron-LM/megatron/training/checkpointing.py", line 396, in save_checkpoint
save_strategy = FullyParallelSaveStrategyWrapper(save_strategy, mpu.get_data_parallel_group(with_context_parallel=True),
File "/workspace/code/Megatron-LM/megatron/core/parallel_state.py", line 917, in get_data_parallel_group
_DATA_PARALLEL_GROUP_WITH_CP is not None
AssertionError: data parallel group with context parallel combined is not initialized
FYI
I remember that this issue is occured after pulling recent commit of main branch.
On 0b4c4cfced47cffad4cec8c4047986bfa60e7f10 commit, error is not occured.
Describe the bug I followed llama_mistral.md using mistral 7b model. (also using llama model too) However, it raises error below.
If i modify manually use_legacy_models = False, another error occurs.
FYI I remember that this issue is occured after pulling recent commit of main branch. On 0b4c4cfced47cffad4cec8c4047986bfa60e7f10 commit, error is not occured.