Closed shlyahin closed 2 months ago
Sorry for the issue. I've updated some typos and other bugs for QLoRA.
It looks like your error is caused by using 1 gpu for deepspeed.
You can fix a bit like this using torchrun
.
torchrun --nproc_per_node 1 \
src/training/train.py \
--lora_enable True \
--vision_lora True \
--lora_namespan_exclude "['lm_head']" \
--lora_rank 32 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--num_lora_modules -1 \
--deepspeed scripts/zero2.json \
--model_id /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi-3-vision-128k-instruct \
--data_path /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/data/train.json \
--image_folder /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/data/images \
--tune_img_projector True \
--freeze_vision_tower False \
--bf16 False \
--output_dir output/lora_vision_test \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--gradient_checkpointing True \
--report_to wandb \
--lazy_preprocess True \
--dataloader_num_workers 4 \
--disable_flash_attn2 True \
--bits 4
Thank you.
I started using just python:
python src/training/train.py \
--lora_enable True \
...
Hello,
when I try to run via torch run or python from root folder I got issue that
from phi3_vision import Phi3VForCausalLM, Phi3VConfig, Phi3VProcessor
from training.trainer import Phi3VTrainer
from training.data import make_supervised_data_module
from training.params import DataArguments, ModelArguments, TrainingArguments
from training.train_utils import get_peft_state_maybe_zero_3, get_peft_state_non_lora_maybe_zero_3,
above imports cannot be found. So had to add
export PYTHONPATH=/path/to/src:$PYTHONPATH
To start trading I did:
I have had added:
sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
from src.phi3_vision import Phi3VForCausalLM, Phi3VConfig, Phi3VProcessor
from src.training.trainer import Phi3VTrainer
from src.training.data import make_supervised_data_module
from src.training.params import DataArguments, ModelArguments, TrainingArguments
from src.training.train_utils import get_peft_state_maybe_zero_3, get_peft_state_non_lora_maybe_zero_3, safe_save_model_for_hf_trainer
And start trying as pycharm run script
But issue regarding 'save_checkpoint' still occur
@pretbc Are you using one gpu?
Yes I got one GPU
seems that issue fixed with:
torchrun --nproc_per_node=1 \
src/training/train.py \
--lora_enable True \
--vision_lora True \
--lora_namespan_exclude "['lm_head']" \
--lora_rank 32 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--num_lora_modules -1 \
--deepspeed scripts/zero2.json \
--model_id microsoft/Phi-3-vision-128k-instruct \
--data_path llava_format.json \
--image_folder /cherry \
--tune_img_projector True \
--freeze_vision_tower False \
--bf16 False \
--output_dir output/lora_vision_test \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--gradient_checkpointing True \
--report_to wandb \
--lazy_preprocess True \
--dataloader_num_workers 4 \
--disable_flash_attn2 True \
--bits 4
Hi,
I've got next error during fine-tuning:
Traceback (most recent call last): File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 619, in getattr return super().getattr(name) # defer to nn.Module's logic File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'PeftModelForCausalLM' object has no attribute 'save_checkpoint'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 330, in getattr return super().getattr(name) # defer to nn.Module's logic File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'LoraModel' object has no attribute 'save_checkpoint'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi3-Vision-Finetune/src/training/train.py", line 220, in
train()
File "/home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi3-Vision-Finetune/src/training/train.py", line 195, in train
trainer.train()
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2673, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi3-Vision-Finetune/src/training/trainer.py", line 137, in _save_checkpoint
super(Phi3VTrainer, self)._save_checkpoint(model, trial, metrics)
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2756, in _save_checkpoint
self._save_optimizer_and_scheduler(output_dir)
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2848, in _save_optimizer_and_scheduler
inspect.signature(self.model_wrapped.save_checkpoint).parameters.keys()
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 621, in getattr
return getattr(self.base_model, name)
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 332, in getattr
return getattr(self.model, name)
File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'Phi3VForCausalLM' object has no attribute 'save_checkpoint'
My train config: