Open hahmad2008 opened 9 months ago
if you're using fp16, you'll likely have to change your learning rate way down. you're getting over/underflows of the fp16 values leading to 0 loss
@winglian but I full finetuned the same TinyLLama model using fp16 with deepspeed Zero2, and there is no problem with it, no NAN weights.
@winglian btw I am using docker version with the following packages versions: cuda: 11.8 pytorch: 2.0.1+cu118 accelerate: 0.24.0.dev0 transformers: 4.35.0.dev0
@winglian I changed the learning rate to learning_rate: 0.000002 and the loss still become ZERO
@winglian Any idea, please?
Which GPU are you using? Are you able to use bf16? It should stabilize loss better.
@NanoCode012 I am using 2 X Tesla T4 using fp16
@hahmad2008 , it's possible that there is instability. Would you be able to use a newer gen GPU that's ampere gen? I would recommend enabling bf16: true
to prevent this issue.
Alternatively, can you try deepspeed? I believe there was also some issue with fsdp a while back.
@NanoCode012 Thanks, I will give it a try and will come back to you.
@NanoCode012 @winglian I tried with bf16 on A10 GPU and the training loss was stable, but with fp16 it was not stable the loss was jumping to zero and weight of the generated model was NAN.
btw, training with FP16, the trainer should use GradScaler. I am wondering if axolotl with FSDP uses gradscaler with mixed precision fp16 or not?
Please check that this issue hasn't been reported before.
Expected Behavior
I fully finetuning using FSDP on TinyLama (4G) on a single machine with two GPUs. The flow is completed as expected and the model size is 2G as expected to use float16. The problem is that the model params and weights are NAN.
weights = torch.load("model-out/pytorch_model.bin")
Log
Current behaviour
Should be a model that doesn't have all NAN values for param and weights.
Steps to reproduce
Same as expected Expected Behavior
Config yaml
Possible solution
-
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
a045db02146751548fec57a5d3f31382ce4e5959
Acknowledgements