Open jinwonkim93 opened 9 months ago
I think the total_num_steps accounts for the gradient accumulation steps (GAS) somewhere non-obvious (I can't track it down atm). I tried a test training with GAS=1 and it had 2401 steps, and then I increased it to GAS=4 leaving everything else the same and it had 600 steps.
I think the total_num_steps accounts for the gradient accumulation steps (GAS) somewhere non-obvious (I can't track it down atm). I tried a test training with GAS=1 and it had 2401 steps, and then I increased it to GAS=4 leaving everything else the same and it had 600 steps.
it does internally in Trainer but custom scheduler you made does not accounts it. which make difference in updating learning rate.
ex.
GAS=1 decrease by each step by cosine.
GAS=4 decrease by every 4 step by cosine.
This may explain: https://github.com/OpenAccess-AI-Collective/axolotl/issues/1100
Please check that this issue hasn't been reported before.
Expected Behavior
total_num_steps should be calculated with accumulation step base on doc in transformers.
https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps
Current behaviour
https://github.com/OpenAccess-AI-Collective/axolotl/blob/0f77b8d7986c2b5d7773771fabcbe8bc8640cbe4/src/axolotl/utils/trainer.py#L243
total_num_steps does not include accumulation step for computation but in the documentation of transformers logging, evaluation every gradient_accumulation_steps * step.
the thing is scheduler does get affected by this max step.
Steps to reproduce
try preprocessing
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements