2U1 / Phi3-Vision-Finetune

An open-source implementaion for fine-tuning Phi3-Vision and Phi3.5-Vision by Microsoft.
Apache License 2.0
43 stars 7 forks source link

'Phi3VForCausalLM' object has no attribute 'save_checkpoint' #17

Closed shlyahin closed 3 days ago

shlyahin commented 3 weeks ago

Hi,

I've got next error during fine-tuning:

Traceback (most recent call last): File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 619, in getattr return super().getattr(name) # defer to nn.Module's logic File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'PeftModelForCausalLM' object has no attribute 'save_checkpoint'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 330, in getattr return super().getattr(name) # defer to nn.Module's logic File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'LoraModel' object has no attribute 'save_checkpoint'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi3-Vision-Finetune/src/training/train.py", line 220, in train() File "/home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi3-Vision-Finetune/src/training/train.py", line 195, in train trainer.train() File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2673, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi3-Vision-Finetune/src/training/trainer.py", line 137, in _save_checkpoint super(Phi3VTrainer, self)._save_checkpoint(model, trial, metrics) File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2756, in _save_checkpoint self._save_optimizer_and_scheduler(output_dir) File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2848, in _save_optimizer_and_scheduler inspect.signature(self.model_wrapped.save_checkpoint).parameters.keys() File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 621, in getattr return getattr(self.base_model, name) File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/peft/tuners/lora/model.py", line 332, in getattr return getattr(self.model, name) File "/home/kiosk1/.cache/pypoetry/virtualenvs/phi3-ivOQmoER-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Phi3VForCausalLM' object has no attribute 'save_checkpoint'

My train config:

--lora_enable True \
--vision_lora True \
--lora_namespan_exclude "['lm_head']" \
--lora_rank 32 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--num_lora_modules -1 \
--deepspeed scripts/zero2.json \
--model_id /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi-3-vision-128k-instruct \
--data_path /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/data/train.json \
--image_folder /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/data/images \
--tune_img_projector True \
--freeze_vision_tower False \
--bf16 False \
--output_dir output/lora_vision_test \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--gradient_checkpointing True \
--report_to wandb \
--lazy_preprocess True \
--dataloader_num_workers 4 \
--disable_flash_attn2 True \
--bits 4
2U1 commented 3 weeks ago

Sorry for the issue. I've updated some typos and other bugs for QLoRA.

It looks like your error is caused by using 1 gpu for deepspeed.

You can fix a bit like this using torchrun.

 torchrun --nproc_per_node 1 \
    src/training/train.py \
  --lora_enable True \
  --vision_lora True \
  --lora_namespan_exclude "['lm_head']" \
  --lora_rank 32 \
  --lora_alpha 16 \
  --lora_dropout 0.05 \
  --num_lora_modules -1 \
  --deepspeed scripts/zero2.json \
  --model_id /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/Phi-3-vision-128k-instruct \
  --data_path /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/data/train.json \
  --image_folder /home/kiosk1/victor/innoseti_ocr_notebooks/cv/ocr/ocr/phi3/data/images \
  --tune_img_projector True \
  --freeze_vision_tower False \
  --bf16 False \
  --output_dir output/lora_vision_test \
  --num_train_epochs 2 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --learning_rate 2e-4 \
  --weight_decay 0. \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --tf32 False \
  --gradient_checkpointing True \
  --report_to wandb \
  --lazy_preprocess True \
  --dataloader_num_workers 4 \
  --disable_flash_attn2 True \
  --bits 4
shlyahin commented 3 weeks ago

Thank you.

I started using just python:

python    src/training/train.py \
  --lora_enable True \
   ...
pretbc commented 3 weeks ago

Hello,

when I try to run via torch run or python from root folder I got issue that

from phi3_vision import Phi3VForCausalLM, Phi3VConfig, Phi3VProcessor
from training.trainer import Phi3VTrainer
from training.data import make_supervised_data_module
from training.params import DataArguments, ModelArguments, TrainingArguments
from training.train_utils import get_peft_state_maybe_zero_3, get_peft_state_non_lora_maybe_zero_3, 

above imports cannot be found. So had to add

export PYTHONPATH=/path/to/src:$PYTHONPATH


To start trading I did:

I have had added:

sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))

from src.phi3_vision import Phi3VForCausalLM, Phi3VConfig, Phi3VProcessor
from src.training.trainer import Phi3VTrainer
from src.training.data import make_supervised_data_module
from src.training.params import DataArguments, ModelArguments, TrainingArguments
from src.training.train_utils import get_peft_state_maybe_zero_3, get_peft_state_non_lora_maybe_zero_3, safe_save_model_for_hf_trainer

And start trying as pycharm run script image

But issue regarding 'save_checkpoint' still occur

2U1 commented 3 weeks ago

@pretbc Are you using one gpu?

pretbc commented 3 weeks ago

Yes I got one GPU

seems that issue fixed with:

torchrun --nproc_per_node=1 \
    src/training/train.py \
    --lora_enable True \
    --vision_lora True \
    --lora_namespan_exclude "['lm_head']" \
    --lora_rank 32 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --num_lora_modules -1 \
    --deepspeed scripts/zero2.json \
    --model_id microsoft/Phi-3-vision-128k-instruct \
    --data_path llava_format.json \
    --image_folder /cherry \
    --tune_img_projector True \
    --freeze_vision_tower False \
    --bf16 False \
    --output_dir output/lora_vision_test \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --gradient_checkpointing True \
    --report_to wandb \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \
    --disable_flash_attn2 True \
    --bits 4