Closed Nero10578 closed 5 months ago
hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!
hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!
It saved on these checkpoints:
There should be 1578 number of steps as can be seen here:
{'train_runtime': 23470.402, 'train_samples_per_second': 16.721, 'train_steps_per_second': 0.067, 'train_loss': 0.09891688005930269, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:31:06<00:00, 14.87s/it]
[2024-05-13 22:54:29,715] [INFO] [axolotl.train.log:61] [PID:802] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out
wandb:
wandb: Run history:
wandb: eval/loss ▁█
wandb: eval/runtime ▁█
wandb: eval/samples_per_second █▁
wandb: eval/steps_per_second █▁
wandb: train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
wandb: train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: train/grad_norm ▁▅▃█▃▃▄▂▂▃▃▂▃▆▄▃▂▅▃▃▃▂▄█▅▄▄▃▁▂▄▂▄▃▃▃▃▄▂▂
wandb: train/learning_rate ██▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb: train/loss ▇▃▆▅▄▄▅▄▅▄▂▅▄▆▃▅▄▅▅▅▅▃▄▄█▃▄▃▃▃▃▅▁▄▄▆▇▄▆▄
wandb:
wandb: Run summary:
wandb: eval/loss 0.56355
wandb: eval/runtime 563.5022
wandb: eval/samples_per_second 3.519
wandb: eval/steps_per_second 1.76
wandb: total_flos 1.01735945815311e+19
wandb: train/epoch 2.0
wandb: train/global_step 1578
wandb: train/grad_norm 0.26172
wandb: train/learning_rate 0.0
wandb: train/loss 0.3887
wandb: train_loss 0.09892
wandb: train_runtime 23470.402
wandb: train_samples_per_second 16.721
wandb: train_steps_per_second 0.067
I've tried running it again to see if its a fluke and no its still failing to save at the end. I've tried with a test super short dataset and it saves fine otherwise. Is there something wrong with my set number of steps? It say this when resuming:
[2024-05-13 23:20:40,426] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
[2024-05-13 23:20:40,583] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
Warning: The training argument 'eval_steps' value (0.125) does not match the trainer state 'eval_steps' value (198). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.
Warning: The training argument 'save_steps' value (0.25) does not match the trainer state 'save_steps' value (395). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.
it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.
@Nero10578 Fixed in #1615
it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.
Awesome fix! Thank you for all your work on this! So essentially this was just a problem with odd saving steps? Explains why it only happens sometimes.
yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.
yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.
Ah I see okay. Thanks for explaining that.
Please check that this issue hasn't been reported before.
Expected Behavior
Expected behavior is to save the last checkpoint like the previous intermediate checkpoints. It has no failed to save the final checkpoint multiple times. I am running this on Ubuntu WSL2 in Windows 11.
Current behaviour
At the end of a training run, it will not save the last checkpoint.
Nothing wrong seems to happen as shown.
Steps to reproduce
Just run any training run both SFT or DPO both I've tried failed to save the last checkpoint. Not sure if there is something wrong in my config yaml for the train or a bug on Axolotl.
I've tried enabling and also disabling wandb since that caused this issue sometime a few months ago as well. This time it made no difference.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11.9
axolotl branch-commit
2147cf6837e2b90a2ea7045262083cbb0da03858
Acknowledgements