Closed liminghao0914 closed 1 year ago
Hi, thank you for your interest in our work. The following are my suggestions:
For LLaMA-Adapter:
--llama_path
argument should always point to LLaMA-V1 weights, with which the BIAS-7B.pth checkpoint is trained.For LLaMA2-Accessory:
The script you provide is correct (though we recommend using bf16 instead of fp16). However, it is used to fine-tune a model that is already pre-trained on large-scale image-text pairs. See here and you can see that the clip_proj
and visual_proj
parameters are not trainable by default. The code for image-text-pair pre-training will be released soon.
However, your results show that your model just says random characters, which indicates problems more than that mentioned above. Please check these two key points that are easy to miss:
Our fine-tuning script only saves the trainable parameters. This means if you just load the saved checkpoint before inference, most parameters will still be random. So you need to merge the saved fine-tuned parameters into the original LLaMA2 parameters before inference.
Could you share the command you use for inference? I think this should work for the 7B multi-modal LLaMA2 adapter:
torchrun --nproc-per-node=1 demos/single_turn_mm.py \
--llama_config /path/to/params.json --tokenizer_path /path/to/tokenizer.model \
--llama_type llama_adapter \
--pretrained_path /path/to/llama_adapter_merged
just make sure that you did not forget to set llama_type to llama_adapter. Again, pretrained_path should point to the merged checkpoint
We will update our codes and docs to better support the PEFT+Multi-modal setting very soon, please stay tuned. If you are interested in multi-modal models, we recommend this one, which generally performs better.
Thanks for your kind help. I found that the above bug was caused by the distributed training mode, which automatically adds module.
in front of each layer. Look forward to the PEFT+Multi-modal setting and your future work.
I've tried both llama-adapter and llama2-accessory. But failed to replicate the multi-modal finetuning results. In llama-adapter, I used BIAS-7B.pth for pretraining. For llama2-accessory, since there's no adapter checkpoint released, I just trained it by myself. Eventually, I got the following results.
For llama-adapter, it is like
For llama2-accessory, the result seems more weird.
I did change the code for some compatibility issues since I can't use flash-attention and apex. I am sure the EOS token is added at the end.
Did I use the wrong hyper params? Attached are the corresponding shell scripts.
For llama-adapter,
For llama-accessory,
Appreciate any suggestions and advice. Thanks! 🙏