I have problems resuming a checkpoint. What I did:
1) python qlora.py --model_name_or_path huggyllama/llama-7b
2) abort when a checkpoint has been written
3) python qlora.py --model_name_or_path huggyllama/llama-7b
I expected fine-tuning to pick up where I aborted it, but instead I get the following error message:
...
torch.uint8 3238002688 0.8846206784649213
Traceback (most recent call last):
File "/workspace/qlora/qlora.py", line 758, in <module>
train()
File "/workspace/qlora/qlora.py", line 720, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
File "/workspace/anaconda3/envs/qlora310/lib/python3.10/site-packages/transformers/trainer.py", line 1685, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/workspace/anaconda3/envs/qlora310/lib/python3.10/site-packages/transformers/trainer.py", line 2159, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at ./output/checkpoint-500
I have problems resuming a checkpoint. What I did: 1)
python qlora.py --model_name_or_path huggyllama/llama-7b
2) abort when a checkpoint has been written 3)python qlora.py --model_name_or_path huggyllama/llama-7b
I expected fine-tuning to pick up where I aborted it, but instead I get the following error message: