THUDM / CogCoM

Other
146 stars 9 forks source link

Gradient computed twice for this partition #26

Open terryII opened 2 months ago

terryII commented 2 months ago

硬件 4*A100(80G)

微调官方com_dataset数据集,出现如下情况

Traceback (most recent call last): File "/home/lyk/project/CogCoM/cogcom/finetune.py", line 324, in model = training_main(args, model_cls=model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 150, in training_main iteration, skipped = train(model, optimizer, ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 349, in train lm_loss, skipped_iter, metrics = train_step(train_data_iterator, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 471, in train_step backward_step(optimizer, model, lm_loss, args, timers) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 507, in backward_step model.backward(loss) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2056, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, args) ^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 701, in backward torch.autograd.backward(output_tensors, grad_tensors) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 903, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1416, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 939, in reduce_independent_p_g_buckets_and_remove_grads assert self.params_already_reduced[param_id] == False, \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: The parameter 67 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 3 iZ6we1raky4t814hj7bojjZ:5534:6182 [0] NCCL INFO [Service thread] Connection closed by localRank 3 iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 3 iZ6we1raky4t814hj7bojjZ:5534:6169 [0] NCCL INFO [Service thread] Connection closed by localRank 3 [2024-07-24 17:27:09,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5534 iZ6we1raky4t814hj7bojjZ:5535:6180 [1] NCCL INFO [Service thread] Connection closed by localRank 0 iZ6we1raky4t814hj7bojjZ:5535:6167 [1] NCCL INFO [Service thread] Connection closed by localRank 0 [2024-07-24 17:27:12,368] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5535 iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 1 iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 1 [2024-07-24 17:27:15,225] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5536 [2024-07-24 17:27:18,082] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5537 [2024-07-24 17:27:18,082] [ERROR] [launch.py:325:sigkill_handler] ['/home/lyk/anaconda3/envs/llm/bin/python', '-u', '/home/lyk/project/CogCoM/cogcom/finetune.py', '--local_rank=3', '--experiment-name', 'finetune-/data/llms/models/cogcom/cogcom-chat-17b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '8000', '--resume-dataloader', '--from_pretrained', '/data/llms/models/cogcom/cogcom-chat-17b', '--max_source_length', '1225', '--max_target_length', '823', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/data/llms/models/cogcom/vicuna-7b-v1.5', '--version', 'chat', '--train-data', '/data/llms/datasets/cogcom/processed/save/com_offical_0724#CoM', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '4000', '--eval-interval', '4000', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', '/home/lyk/project/CogCoM/cogcom/test_config_bf16_zero1off.json', '--skip-init', '--iterable-dataset', '--seed', '2024'] exits with return code = 1

debug时发现应该是‘crop_and_zoomin’操作后forward了两次(turnid有0和1两轮),然后将两次的loss累加,导致backward时出现重复计算Gradient,请问这个该如果解决? @qijimrc