deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself
https://coder.deepseek.com/
MIT License
6.01k stars 433 forks source link

Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

Open zpx01 opened 5 months ago

zpx01 commented 5 months ago

I'm trying to fine-tune the 6.7B model on my own code dataset. I am running a multinode training with fp32 precision on NVIDIA Tesla V100 GPUs with DeepSpeed ZeRO Stage 3. My training loss seems to randomly fluctuate and go down to zero, I've attached my training loss graph below:

Screenshot 2024-01-25 at 10 19 48 PM

I'm running this on 128 GPUs with a train batch size of 1 per device and no gradient accumulation. I'm not sure what could be the cause of this as I haven't seen this happen with other models with the Llama architecture. Would appreciate any general direction to help debug this, thanks!

zpx01 commented 4 months ago

@DejianYang @pkuzqh Would appreciate any help on this ticket, thanks