Open ccdv-ai opened 4 weeks ago
Hey, just putting some quick thoughts before I look into this in more detail tomorrow.
When I was comparing results before and after for this PR, I notice that the results after are better. Do you have wandb charts to compare?
*-pre
: before.
Note: The results above are from completion
. I didn't compare sft
.
I think the training is fine, but the logged values are somehow wrong for SFT. I dont have a chart but I have some values. I use packed training with qwen 2.5, batch of 262 144 tokens:
If I divide the values of GA=8 by 4, I get something very close to GA=2 at the start and at the end of training.
Can confirm this appears to be a strictly visual issue - Eval (and testing afterwards) shows the model is learning accordingly. I was using a GA of 4 and started each run with loss values in the 5-6 range, which when divided matches my usual training runs. (SFT, Llama 8B)
can confirm on this. the actual loss should be divided by GA
I ran some non-packing tests and couldn't see this. Can someone provide an example config?
*-pre
runs are from 1d6a5e2bd638778a42d757ff0cb600f918eb1c31
https://github.com/axolotl-ai-cloud/axolotl/commit/1d6a5e2bd638778a42d757ff0cb600f918eb1c31
Edit: Added packing tests.
@NanoCode012 i think its more of comparing between different tuners. For example, if i use another package such as Unsloth, the loss is actually the loss of axolotl divided by the number of GA, despite everything else being identical. As such, like what others have mentioned, the loss in axolotl is not correct.
i have a feeling that the loss in axolotl is not divided by number of GA.
Updating transformers to 4.46.2 and liger to 0.4.0 fix it for me.
@ccdv-ai , could you share how the logs look?
@jackswl , I’m running a few sft trl tests for comparison, but would you perhaps have a comparison against unsloth?
@jackswl @ccdv-ai
Sorry this took a while. This is the comparison between trl and axolotl sft (trl runs has *-trl
in its name). I tried to keep as much hyp the same, but there are still some differences with handling of prompt masking etc.
However, you can see how, increasing the GA does not increase the loss multiple times in axolotl. Trl's loss also ranges around the same amount when varying mbs and GA.
Please check that this issue hasn't been reported before.
Expected Behavior
Since GA fix (#1980), logging does not average loss and grad norm values over accumulation steps, they are summed instead which makes the comparison difficult between different values of GA. i.e for 8 accumulation steps
{'loss': 7.9071, 'grad_norm': 6.211667537689209, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}
should be{'loss': 0.988, 'grad_norm': 0.776, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}
Current behaviour
Loss and grad norm are summed
Steps to reproduce
Any training process
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements