Closed WYY220062 closed 3 months ago
Sorry I couldn't fully debug the full-finetuning code. Can uncomment this part and try it again? It might have been caused by when saving the model. I'll look on to it.
Thank you for your reply!!! I uncomment these lines, and get the following error File "./Phi3-Vision-ft/src/training/train_utils.py", line 70, in safe_save_model_for_hf_trainer accelerator.save(trainer.model, output_dir, max_shard_size = '5GB') TypeError: Accelerator.save() got an unexpected keyword argument 'max_shard_size'
Do I uncomment the code in the way you suggest correctly?
I remove max_shard_size = '5GB', the training process run smoothly, but only 6 files in the test_train folder: config.json preprocessor_config.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json trainer_state.json. Seems the parameter files have not been saved.
Sorry for the issue, the code should be changed to accelerator.save_model(...)
I've changed in mergeging lora scripts but I forgot to change it in this file.
I change to code to save_model as you suggest, accelerator.save_model(trainer.model, output_dir, max_shard_size = '5GB')
, but have the following error
File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 226, in <module>
train()
if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config' File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 222, in train
safe_save_model_for_hf_trainer(trainer, output_dir=training_args.output_dir)
if self.deepspeed_config["zero_optimization"]["stage"] == 3:
File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train_utils.py", line 71, in safe_save_model_for_hf_trainer
accelerator.save_model(trainer.model, output_dir, max_shard_size = '5GB')
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config' if self.deepspeed_config["zero_optimization"]["stage"] == 3:
File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/accelerate/accelerator.py", line 2626, in save_model
state_dict = self.get_state_dict(model)
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config' if self.deepspeed_config["zero_optimization"]["stage"] == 3: if self.deepspeed_config["zero_optimization"]["stage"] == 3:
if self.deepspeed_config["zero_optimization"]["stage"] == 3:
File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/accelerate/accelerator.py", line 3102, in get_state_dict
if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeErrorAttributeError: : 'Accelerator' object has no attribute 'deepspeed_config''Accelerator' object has no attribute 'deepspeed_config'
AttributeErrorAttributeError: 'Accelerator' object has no attribute 'deepspeed_config'
:
'Accelerator' object has no attribute 'deepspeed_config' if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'
But the finetune_lora.sh and merge_lora.sh works well. Do you have any idea on this?
I think it's because I changed the hyperparameter names in trainger for the deepspeed config. I'll fix it soon. Thanks for reporting it!
@WYY220062 Hmm..can you just use the trainer.save_model(..)
? like this
if trainer.deepspeed:
torch.cuda.synchronize()
trainer.save_model(output_dir)
return
Can you test the full-finetuning with a partial data with very tiny size?
Sure, I am ready to test with 5K query-response pairs. I change the code like:
But having the following wired error
wandb: ERROR It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found) Traceback (most recent call last): File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 226, in <module> train() File "/home/opc/yuying_phi_3_dev/Phi3-Vision-ft/src/training/train.py", line 202, in train trainer.train() File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer.py", line 2147, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer_callback.py", line 454, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/trainer_callback.py", line 498, in call_event result = getattr(callback, event)( File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 773, in on_train_begin self.setup(args, state, model, **kwargs) File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 746, in setup self._wandb.init( File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init raise e File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1181, in init run = wi.init() File "/home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 780, in init raise error wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)
Do you want me to reduce the data size?
@WYY220062 Thats an wandb error. I think you need to relogin into wandb.
Yes, I did the relogin. Seems doesnt work.
wandb: WARNING The
run_nameis currently set to the same value as
TrainingArguments.output_dir. If this was not intended, please specify a different run name by setting the
TrainingArguments.run_nameparameter. wandb: Currently logged in as: yuying_wyy. Use
wandb login --reloginto force relogin wandb: ERROR Error while calling W&B API: run "8pxjit6x" not found while updating run (<Response [404]>) Problem at: /home/opc/miniconda3/envs/phi3v/lib/python3.10/site-packages/transformers/integrations/integration_utils.py 746 setup wandb: ERROR It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)
@WYY220062 You should change the output_dir name or delete it. It just saying there is an duplicated name. Also, I'm not sure you should delete the run in wandb.
Sorry, do you mean the output_dir name in the finetune.sh? I have no idea where is the duplicated name. I have't did the login before, everything works well. Only have the error today.
@WYY220062 OK, you can set to none for it.
@WYY220062 Sorry, I meant setting the report_to
to none in bash file.
I change the code as above suggested. Having the following error:
Seems you set safe_serialization=False in the merge_lora_weights.py
@WYY220062 Sorry, I was doing some little bit of testing.
else:
# safe_save_model_for_hf_trainer(trainer, output_dir=training_args.output_dir)
state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters(), require_grad_only=False)
model.config.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)
torch.save(state_dict, os.path.join(training_args.output_dir, "pytorch_model.bin"))
You can use this for saving in train.py
. I'm sorry for the issue.
else:
state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters(), require_grad_only=False)
model.config.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)
trainer._save(output_dir=training_args.output_dir, state_dict=state_dict)
Or you could use this one to save in safe_tensor. It worked for me.
Let me know the results for either using one of these. Thanks for your help!
Thank you for the reply. This code works!
Hi! I do the fine tuning by full training, and get config.json, model.safetensors, special_tokens_map.json tokenizer.jsson, training_args.bin, generation_config.json, preprocessor_config.json, tokenizer_config.json, trainer_state.json under ./output/test_train But when i call python -m src.serve.cli --model-path --image-file, I have the error ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32064, 3072])), this look incorrect. When I set the --model-path as original phi3 vision, it works. But the fine tuning process seems look good. Do you have an idea?