Open bhattg opened 1 year ago
Hello, which version of python and cuda are you using? Thank you.
This is a very interesting discovery, and I believe it may be related to the learning rate schedule and warm-up settings, although there could be other factors worth exploring.
Hello, sorry I couldn't get back with the question on python version 3.10.13 and CUDA 11.7
Experiment was run using torch 1.13.0
Regarding the learning dynamics, I am using the following
--fix_rate 0.7 --lr 1e-05 --lr-decay-style cosine --warmup 0.0 --batch_size 32 --accumulation_steps 1 --epochs 50
Hi! I am trying to train a reward model, and I am confused why in the initial iterations of training the gradients are not changing and neither the loss is changing. Only after some steps does it suddenly change and then learning is completed.
Following is the attached learning dynamics.