Open hahmad2008 opened 9 months ago
Was there any stack trace error? Did you run out of space? Did the run abruptly quit?
@NanoCode012 No at all.
Also fp16 is not applied here, the checkpoint is 4G! not 2G.
Not sure how axolotl should behave, but from other experience optimizer state (gradients) might be part of the weights ballooning the space. We always do an import->save weights_only=True
trick for the weights we actually keep.
As for the model folder being small, we ran into similar issue that we needed to use the checkpoint folder as the point to launch inference playground instead of the model folder, like so
python -m axolotl.cli.inference config.yaml --base_model='model-finetuned/checkpoint-4130/' --gradio
@hahmad2008 , sorry I didn't get to follow up. I have recently used deepspeed z3 for training, and it worked. Perhaps the issue is now solved?
Please check that this issue hasn't been reported before.
Expected Behavior
With full finetune, expect a model with size 2G in the output directory however, the model directory size is 1M! The checkpoint should be 2G since we enable fp16.
Current behaviour
For TinyLLama, for full finetune, the model is not saved in the model directory.
Config
accelerate-config.yaml
config.yaml
deepspeed/zero3.json
Final Model
ls -lh model-finetuned
ls -lh model-finetuned/checkpoint-4130/
Steps to reproduce
For TinyLLama, for full finetune, the model is not saved in the model directory.
Command
accelerate launch --config_file accelerate-config.yaml scripts/finetune.py axolotl/config.yaml
Config
accelerate-config.yaml
config.yaml
deepspeed/zero3.json
Final Model
ls -lh model-finetuned
ls -lh model-finetuned/checkpoint-4130/
Config yaml
config.yaml
Possible solution
I checked this issue, however the latest branch doesn't solve the problem.
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements