Computation of total_num_steps must include accumulation step

jinwonkim93 commented 9 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

total_num_steps should be calculated with accumulation step base on doc in transformers.

https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps

Current behaviour

https://github.com/OpenAccess-AI-Collective/axolotl/blob/0f77b8d7986c2b5d7773771fabcbe8bc8640cbe4/src/axolotl/utils/trainer.py#L243

total_num_steps does not include accumulation step for computation but in the documentation of transformers logging, evaluation every gradient_accumulation_steps * step.

the thing is scheduler does get affected by this max step.

Steps to reproduce

try preprocessing

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

winglian commented 9 months ago

I think the total_num_steps accounts for the gradient accumulation steps (GAS) somewhere non-obvious (I can't track it down atm). I tried a test training with GAS=1 and it had 2401 steps, and then I increased it to GAS=4 leaving everything else the same and it had 600 steps.

jinwonkim93 commented 9 months ago

I think the total_num_steps accounts for the gradient accumulation steps (GAS) somewhere non-obvious (I can't track it down atm). I tried a test training with GAS=1 and it had 2401 steps, and then I increased it to GAS=4 leaving everything else the same and it had 600 steps.

it does internally in Trainer but custom scheduler you made does not accounts it. which make difference in updating learning rate.

ex.

GAS=1 decrease by each step by cosine.

GAS=4 decrease by every 4 step by cosine.

DreamGenX commented 9 months ago

This may explain: https://github.com/OpenAccess-AI-Collective/axolotl/issues/1100

axolotl-ai-cloud / axolotl