Closed Grey4sh closed 1 month ago
Here are our best SFT practices, which you can refer to in order to verify if there are any configuration errors.
https://github.com/QwenLM/Qwen2.5-Coder/tree/main/sft
Noted that, we have made an update to the special tokens from codeqwen1.5 to qwen2.5-coder. Please confirm whether there are any issues related to special tokens during the training process.
{
"<|fim_prefix|>": 151659,
"<|fim_middle|>": 151660,
"<|fim_suffix|>": 151661,
"<|fim_pad|>": 151662,
"<|repo_name|>": 151663,
"<|file_sep|>": 151664,
"<|im_start|>": 151644,
"<|im_end|>": 151645
}
Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?
Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?
small LR may work
Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?
Could you reproduce the current SFT script's solution? If there are any issues, please provide more detailed reproducible content to assist further.
Okay, I will provide further information once the script adaptation is completed.
@cyente
Enconter train error when reproduced the train script in official repo.
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5225f7a897 in /home/chatgpt/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f51d98d7e33 in /home/chatgpt/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f52256b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f5226f5aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f5226fec850 in /lib/x86_64-linux-gnu/libc.so.6)
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
hey we have modified the phenomena of some tokenization errors in the sft script, and you can try again now.
I used a self-built FIM SFT dataset for fine-tuning, and encountered abnormal loss when training with DeepSpeed ZeRO2. However, the same dataset did not have this issue on CodeQwen1.5. After switching to ZeRO3, the training proceeded normally. Is this a problem with the model architecture or an incompatibility with the DeepSpeed version? BTW, the version of my DeepSpeed is 0.13.2