I was trying to test the finetune/finetune.py script. It seems that state.best_model_checkpoint always return None leading to a failure at the end of the program. Is it that the program did not save a "best model" during training? I am a bit new to this, could anyone give some explanation on this and offer some hints on solving it? Thanks a lot!
Hi friends,
I was trying to test the finetune/finetune.py script. It seems that state.best_model_checkpoint always return None leading to a failure at the end of the program. Is it that the program did not save a "best model" during training? I am a bit new to this, could anyone give some explanation on this and offer some hints on solving it? Thanks a lot!
command(single GPU):
python finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 1 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 1 --weight_decay 0.05 --output_dir="./checkpoints"
error image(single GPU):
command(mulit GPUs): python -m torch.distributed.launch --nproc_per_node 4 finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10000 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir="./checkpoints"
error image(mulit GPU):