Lora weight issue when conducting fine-tune stage

DingYX0731 commented 3 months ago

Hi! I am trying to reimplement the fine-tune stage of MolCA and running the code: python stage2.py --root 'data/PubChem324kV2/' --devices '0,1' --filename "ft_pubchem324k" --stage2_path "all_checkpoints/stage2/last.ckpt" --opt_model 'facebook/galactica-1.3b' --max_epochs 100 --mode ft --prompt '[START_I_SMILES]{}[END_I_SMILES]. ' --tune_gnn --llm_tune lora --inference_batch_size 8 The training was conducted fluently but the output checkpoint confused me a lot, since the checkpoint format is different from the shared checkpoint given on huggingface. After training, the only checkpoint saved was last.ckpt while the shared checkpoint includes two parts: adapter_model.bin and chebi.ckpt.

At first, I was thinking the lora weight may also be saved in last.ckpt. Indeed, there are weights related to lora saved in the checkpoint but it seems that the weights have not beed fine-tuned??? because they are lora.default.weight rather than lora.weight.

Comparing checkpoint last.ckpt and shared chebi.ckpt: QQ_1722926214847

While in shared adapter_model.bin, the lora weight is not default: QQ_1722926355697

However, when conducting fine-tuning, the validation is working good, but evaluation is not good. I think maybe the checkpoint was not properly saved.

Is there any idea that may solve the checkpoint issue?

Thanks!

acharkq commented 3 months ago

Thanks for the question. The weights in both adapter_model.bin and chebi.ckpt should be used during evaluation. To do this, you can use --init_checkpoint to load weights from chebi.ckpt and use --peft_dir to load weights from adapter_model.bin.

You can refer to the caption evaluation script below:

python stage2.py --devices '[0]' --filename chebi_evaluation --stage2_path "all_checkpoints/share/chebi.ckpt" --opt_model 'facebook/galactica-1.3b' --mode eval --prompt '[START_I_SMILES]{}[END_I_SMILES]. ' --tune_gnn --llm_tune lora --inference_batch_size 8 --root "data/ChEBI-20_data" --peft_dir "all_checkpoints/share/chebi_lora" --init_checkpoint all_checkpoints/share/chebi.ckpt;

In this script, you should replace --peft_dir "all_checkpoints/share/chebi_lora" to the parent folder of adapter_model.bin

DingYX0731 commented 3 months ago

Thanks for your reply. I am still confused about how to obtain chebi.ckpt and adapter_model.bin if I train the model from scratch? It seems that if I train from scratch, the only checkpoint loaded is last.ckpt, which I don't know how to use. Sorry for bothering you again!

acharkq / MolCA

Lora weight issue when conducting fine-tune stage #14