QwenLM / Qwen2.5-Coder

Qwen2.5-Coder is the code version of Qwen2.5, the large language model series developed by Qwen team, Alibaba Cloud.
3.04k stars 202 forks source link

[Train bug] Gradient Explosion in SFT training stage with DeepSpeed ZeRO-2 #109

Closed Grey4sh closed 1 month ago

Grey4sh commented 2 months ago

梯度爆炸

I used a self-built FIM SFT dataset for fine-tuning, and encountered abnormal loss when training with DeepSpeed ZeRO2. However, the same dataset did not have this issue on CodeQwen1.5. After switching to ZeRO3, the training proceeded normally. Is this a problem with the model architecture or an incompatibility with the DeepSpeed version? BTW, the version of my DeepSpeed is 0.13.2

cyente commented 2 months ago

Here are our best SFT practices, which you can refer to in order to verify if there are any configuration errors.

https://github.com/QwenLM/Qwen2.5-Coder/tree/main/sft

Noted that, we have made an update to the special tokens from codeqwen1.5 to qwen2.5-coder. Please confirm whether there are any issues related to special tokens during the training process.

{
  "<|fim_prefix|>": 151659, 
  "<|fim_middle|>": 151660, 
  "<|fim_suffix|>": 151661, 
  "<|fim_pad|>": 151662, 
  "<|repo_name|>": 151663, 
  "<|file_sep|>": 151664, 
  "<|im_start|>": 151644, 
  "<|im_end|>": 151645
}
Grey4sh commented 2 months ago

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

oo0-0-0oo commented 2 months ago

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

small LR may work

cyente commented 1 month ago

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

Could you reproduce the current SFT script's solution? If there are any issues, please provide more detailed reproducible content to assist further.

Grey4sh commented 1 month ago

Okay, I will provide further information once the script adaptation is completed.

Grey4sh commented 1 month ago

@cyente

Enconter train error when reproduced the train script in official repo.

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5225f7a897 in /home/chatgpt/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f51d98d7e33 in /home/chatgpt/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f52256b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f5226f5aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f5226fec850 in /lib/x86_64-linux-gnu/libc.so.6)

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
cyente commented 1 month ago

hey we have modified the phenomena of some tokenization errors in the sft script, and you can try again now.