Open LouChao98 opened 2 months ago
Same here. This issue can be pretty serious, and needs to be fixed very soon.
I confirm this is a bug and your fix looks relevant, thanks :+1:
Please note the zarr format is being deprecated and in particular does not play well with the DistributedOptimizer, so I suggest updating the ckpt format to --dist-ckpt-format torch_dist
.
Does this problem occur only when Tensor Parallelism (TP) > 1 and Data Parallelism (DP) > 1? Currently, I am using DistributedOptimizer with TP = 1 and DP > 1. Will storing checkpoints in Zarr format cause an issue?
Does this problem occur only when Tensor Parallelism (TP) > 1 and Data Parallelism (DP) > 1? Currently, I am using DistributedOptimizer with TP = 1 and DP > 1. Will storing checkpoints in Zarr format cause an issue?
With TP=1 it might be an issue as well, please use --dist-ckpt-format torch_dist
.
Does this problem occur only when Tensor Parallelism (TP) > 1 and Data Parallelism (DP) > 1? Currently, I am using DistributedOptimizer with TP = 1 and DP > 1. Will storing checkpoints in Zarr format cause an issue?
It is okay with TP = 1 and DP > 1, in my envs.
Marking as stale. No activity in 60 days.
same issue with "--use-distributed-optimizer --ckpt-format torch". Megatron: core_r0.9.0, 1afee592e85ac7994887eb5f4ef3998f76384333
same issue with "--use-distributed-optimizer --ckpt-format torch". Megatron: core_r0.9.0, 1afee59
@Jayoprell this one is not expected, can you elaborate on the symptoms? With --ckpt-format torch
we don't do any file-based synchronization.
When using use-distributed-optimizer
with ckpt-format torch, save_checkpoint has only dp rank=0 gathers all the others optimizer status and save to file. So, it should not have any sync problem?
The error info:
================== tensor keys: dict_keys(['param']), dp rank: 0, optim_state:{}, main_param:tensor([0., 0., 0., ..., 0., 0., 0.], ) ================
Traceback (most recent call last):
File "/workspace/Megatron-LM/pretrain_gpt.py", line 264, in <module>
pretrain(
File "/workspace/Megatron-LM/megatron/training/training.py", line 349, in pretrain
iteration, num_floating_point_operations_so_far = train(
File "/workspace/Megatron-LM/megatron/training/training.py", line 1366, in train
save_checkpoint_and_time(iteration, model, optimizer,
File "/workspace/Megatron-LM/megatron/training/training.py", line 1070, in save_checkpoint_and_time
save_checkpoint(iteration, model, optimizer, opt_param_scheduler,
File "/workspace/Megatron-LM/megatron/training/checkpointing.py", line 380, in save_checkpoint
optimizer.save_parameter_state(optim_checkpoint_name)
File "/workspace/Megatron-LM/megatron/core/optimizer/distrib_optimizer.py", line 902, in save_parameter_state
state_dict = self.get_parameter_state_dp_zero()
File "/workspace/Megatron-LM/megatron/core/optimizer/distrib_optimizer.py", line 852, in get_parameter_state_dp_zero
tensors[key].detach().cpu()
KeyError: 'exp_avg'
The above tensors is optimizer parameter, and it seems that optimizer state is None.
Describe the bug
When using a Zarr distributed checkpoint and a distributed optimizer, each rank writes optimizer states according to ShardedTensor's flattened_range. The Zarr strategy uses synchronizers to ensure the correctness of parallel writing. However, synchronizers are not set for ranks that create Zarr arrays. The current implementation only adds synchronizers on ranks that open existing Zarr arrays. Consequently, the writing on the creating ranks may be lost, resulting in all zeros at the corresponding slices in the file.
To Reproduce
run
pretrain_gpt.py
with DP>1, TP>1 and argumentsThen, a toy test inserted after
dist_checkpointing.save
in the following block may not pass https://github.com/NVIDIA/Megatron-LM/blob/86e2927edaa977f3e859d6f4b6d38a236114fd38/megatron/training/checkpointing.py#L405-L407Add a barrier under this line and using larger DP size may increase the probability of reproducing the failure: https://github.com/NVIDIA/Megatron-LM/blob/86e2927edaa977f3e859d6f4b6d38a236114fd38/megatron/core/dist_checkpointing/strategies/zarr.py#L64
Expected behavior
All tensors should be written to the disk.
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Proposed fix
Set synchronizers when creating Zarr arrays, mirroring the logic used when opening existing Zarr arrays:
Additional context Add any other context about the problem here.