Open gothaleshubham opened 2 months ago
a couple of things. I would recommend disabling sample packing with the gpt2 model as it does not really support it wince it doesn't have flash attention support. Also the max sequence length for gpt2 is 1024. I noticed when I tried it with the setting of 2048 you are using, it ran into a cuda/nccl issue which is likely what you were seeing.
I think once you make these changes, it should be fine, as it saved properly for me after that.
Please check that this issue hasn't been reported before.
Expected Behavior
Expected behaviour is to save model weight as in output directory
Current behaviour
Only saving tokenizer information in output directory and model weight are not even after training is completed. I am using 2 gpus for training the model. While saving and I see the utilization on gpu2 as shown in below. But not saving weights and stuck error.log
Steps to reproduce
Data : train_data_1K.json
git clone https://github.com/axolotl-ai-cloud/axolotl cd axolotl
pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]' cd .. accelerate launch -m axolotl.cli.train model.yaml
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements