Error when call inference after the fine tuning by full training

WYY220062 commented 3 months ago

Hi! I do the fine tuning by full training, and get config.json, model.safetensors, special_tokens_map.json tokenizer.jsson, training_args.bin, generation_config.json, preprocessor_config.json, tokenizer_config.json, trainer_state.json under ./output/test_train But when i call python -m src.serve.cli --model-path --image-file, I have the error ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32064, 3072])), this look incorrect. When I set the --model-path as original phi3 vision, it works. But the fine tuning process seems look good. Do you have an idea?

2U1 commented 3 months ago

Sorry I couldn't fully debug the full-finetuning code. Can uncomment this part and try it again? It might have been caused by when saving the model. I'll look on to it.

https://github.com/2U1/Phi3-Vision-ft/blob/e90cf1c4c74e895b7f80b46c8a05ae9fa27bc5d5/src/training/train_utils.py#L63-L72

WYY220062 commented 3 months ago

Thank you for your reply!!! I uncomment these lines, and get the following error File "./Phi3-Vision-ft/src/training/train_utils.py", line 70, in safe_save_model_for_hf_trainer accelerator.save(trainer.model, output_dir, max_shard_size = '5GB') TypeError: Accelerator.save() got an unexpected keyword argument 'max_shard_size'

Do I uncomment the code in the way you suggest correctly?

I remove max_shard_size = '5GB', the training process run smoothly, but only 6 files in the test_train folder: config.json preprocessor_config.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json trainer_state.json. Seems the parameter files have not been saved.

2U1 commented 3 months ago

Sorry for the issue, the code should be changed to accelerator.save_model(...) I've changed in mergeging lora scripts but I forgot to change it in this file.

WYY220062 commented 3 months ago

I change to code to save_model as you suggest, accelerator.save_model(trainer.model, output_dir, max_shard_size = '5GB'), but have the following error

File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 226, in <module>
    train()
    if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'  File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 222, in train
    safe_save_model_for_hf_trainer(trainer, output_dir=training_args.output_dir)
    if self.deepspeed_config["zero_optimization"]["stage"] == 3:

  File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train_utils.py", line 71, in safe_save_model_for_hf_trainer
    accelerator.save_model(trainer.model, output_dir, max_shard_size = '5GB')
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'    if self.deepspeed_config["zero_optimization"]["stage"] == 3:
  File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/accelerate/accelerator.py", line 2626, in save_model
    state_dict = self.get_state_dict(model)
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'        if self.deepspeed_config["zero_optimization"]["stage"] == 3:    if self.deepspeed_config["zero_optimization"]["stage"] == 3:
if self.deepspeed_config["zero_optimization"]["stage"] == 3:

  File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/accelerate/accelerator.py", line 3102, in get_state_dict
    if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeErrorAttributeError: : 'Accelerator' object has no attribute 'deepspeed_config''Accelerator' object has no attribute 'deepspeed_config'
AttributeErrorAttributeError: 'Accelerator' object has no attribute 'deepspeed_config'
: 

'Accelerator' object has no attribute 'deepspeed_config'    if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'

But the finetune_lora.sh and merge_lora.sh works well. Do you have any idea on this?

2U1 commented 3 months ago

I think it's because I changed the hyperparameter names in trainger for the deepspeed config. I'll fix it soon. Thanks for reporting it!

2U1 commented 3 months ago

@WYY220062 Hmm..can you just use the trainer.save_model(..) ? like this

    if trainer.deepspeed:
        torch.cuda.synchronize()
        trainer.save_model(output_dir)
        return

Can you test the full-finetuning with a partial data with very tiny size?

WYY220062 commented 3 months ago

Sure, I am ready to test with 5K query-response pairs. I change the code like:

But having the following wired error wandb: ERROR It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found) Traceback (most recent call last): File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 226, in <module> train() File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 202, in train trainer.train() File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer.py", line 2147, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer_callback.py", line 454, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer_callback.py", line 498, in call_event result = getattr(callback, event)( File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 773, in on_train_begin self.setup(args, state, model, **kwargs) File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 746, in setup self._wandb.init( File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init raise e File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1181, in init run = wi.init() File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 780, in init raise error wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)

Do you want me to reduce the data size?

2U1 commented 3 months ago

@WYY220062 Thats an wandb error. I think you need to relogin into wandb.

WYY220062 commented 3 months ago

Yes, I did the relogin. Seems doesnt work. wandb: WARNING Therun_nameis currently set to the same value asTrainingArguments.output_dir. If this was not intended, please specify a different run name by setting theTrainingArguments.run_nameparameter. wandb: Currently logged in as: yuying_wyy. Usewandb login --reloginto force relogin wandb: ERROR Error while calling W&B API: run "8pxjit6x" not found while updating run (<Response [404]>) Problem at: /home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/integrations/integration_utils.py 746 setup wandb: ERROR It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)

2U1 commented 3 months ago

@WYY220062 You should change the output_dir name or delete it. It just saying there is an duplicated name. Also, I'm not sure you should delete the run in wandb.

WYY220062 commented 3 months ago

Sorry, do you mean the output_dir name in the finetune.sh? I have no idea where is the duplicated name. I have't did the login before, everything works well. Only have the error today.

2U1 commented 3 months ago

@WYY220062 OK, you can set to none for it.

2U1 commented 3 months ago

@WYY220062 Sorry, I meant setting the report_to to none in bash file.

WYY220062 commented 3 months ago

I change the code as above suggested. Having the following error:

Seems you set safe_serialization=False in the merge_lora_weights.py

2U1 commented 3 months ago

@WYY220062 Sorry, I was doing some little bit of testing.

else:
    # safe_save_model_for_hf_trainer(trainer, output_dir=training_args.output_dir)
    state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters(), require_grad_only=False)
    model.config.save_pretrained(training_args.output_dir)
    processor.save_pretrained(training_args.output_dir)
    torch.save(state_dict, os.path.join(training_args.output_dir, "pytorch_model.bin"))

You can use this for saving in train.py. I'm sorry for the issue.

 else:
    state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters(), require_grad_only=False)
    model.config.save_pretrained(training_args.output_dir)
    processor.save_pretrained(training_args.output_dir)
    trainer._save(output_dir=training_args.output_dir, state_dict=state_dict)

Or you could use this one to save in safe_tensor. It worked for me.

Let me know the results for either using one of these. Thanks for your help!

WYY220062 commented 3 months ago

Thank you for the reply. This code works!

2U1 / Phi3-Vision-Finetune

Error when call inference after the fine tuning by full training #8