Model finished training, but adapter_model.bin is empty?

disarmyouwitha commented 1 year ago

I started the training using:

python qlora.py \
    --model_name_or_path /home/nap/llm_models/llamaOG-65B-hf/ \
    --output_dir ./output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval False \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit \
    --learning_rate 0.0001

It took 2.5 days but completed successfully, I checked the /output folder to find all of the checkpoint folders, but I don't think I have the final output (an adapter_model.bin around ~3gb)

Am I just being dumb? Thanks!

KKcorps commented 1 year ago

No, it's a bug

See https://github.com/artidoro/qlora/pull/44

disarmyouwitha commented 1 year ago

@KKcorps I see, thank you...

Since the adaptor files weren't written properly during checkpoints, I'm guessing that would require retraining after the fix? =x

KKcorps commented 1 year ago

if you have pytorch.bin files in the checkpoint dir then it won't but otherwise it might

vihangd commented 1 year ago

I had some luck with my port of alpaca-lora to QLoRa. You can try it from https://github.com/vihangd/alpaca-qlora Though I have only tested on Open LLama 3b model

artidoro / qlora

Model finished training, but adapter_model.bin is empty? #69