Logging behavior since GA fix

ccdv-ai commented 4 weeks ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Since GA fix (#1980), logging does not average loss and grad norm values over accumulation steps, they are summed instead which makes the comparison difficult between different values of GA. i.e for 8 accumulation steps {'loss': 7.9071, 'grad_norm': 6.211667537689209, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52} should be {'loss': 0.988, 'grad_norm': 0.776, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}

Current behaviour

Loss and grad norm are summed

Steps to reproduce

Any training process

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 commented 4 weeks ago

Hey, just putting some quick thoughts before I look into this in more detail tomorrow.

When I was comparing results before and after for this PR, I notice that the results after are better. Do you have wandb charts to compare?

*-pre: before.

Note: The results above are from completion. I didn't compare sft.

ccdv-ai commented 4 weeks ago

I think the training is fine, but the logged values are somehow wrong for SFT. I dont have a chart but I have some values. I use packed training with qwen 2.5, batch of 262 144 tokens:

starting loss pre patch (GA=2) : 1.3564
starting loss post patch (GA=2) : 2.715
starting loss post patch (GA=8) : 10. 848

If I divide the values of GA=8 by 4, I get something very close to GA=2 at the start and at the end of training.

Gryphe commented 4 weeks ago

Can confirm this appears to be a strictly visual issue - Eval (and testing afterwards) shows the model is learning accordingly. I was using a GA of 4 and started each run with loss values in the 5-6 range, which when divided matches my usual training runs. (SFT, Llama 8B)

jackswl commented 3 weeks ago

can confirm on this. the actual loss should be divided by GA

NanoCode012 commented 3 weeks ago

I ran some non-packing tests and couldn't see this. Can someone provide an example config?

*-pre runs are from 1d6a5e2bd638778a42d757ff0cb600f918eb1c31 https://github.com/axolotl-ai-cloud/axolotl/commit/1d6a5e2bd638778a42d757ff0cb600f918eb1c31

Edit: Added packing tests.

jackswl commented 3 weeks ago

@NanoCode012 i think its more of comparing between different tuners. For example, if i use another package such as Unsloth, the loss is actually the loss of axolotl divided by the number of GA, despite everything else being identical. As such, like what others have mentioned, the loss in axolotl is not correct.

i have a feeling that the loss in axolotl is not divided by number of GA.

ccdv-ai commented 2 weeks ago

Updating transformers to 4.46.2 and liger to 0.4.0 fix it for me.

NanoCode012 commented 2 weeks ago

@ccdv-ai , could you share how the logs look?

NanoCode012 commented 2 weeks ago

@jackswl , I’m running a few sft trl tests for comparison, but would you perhaps have a comparison against unsloth?

NanoCode012 commented 1 week ago

@jackswl @ccdv-ai

Sorry this took a while. This is the comparison between trl and axolotl sft (trl runs has *-trl in its name). I tried to keep as much hyp the same, but there are still some differences with handling of prompt masking etc.

However, you can see how, increasing the GA does not increase the loss multiple times in axolotl. Trl's loss also ranges around the same amount when varying mbs and GA.

axolotl-ai-cloud / axolotl