Closed Mikael17125 closed 3 months ago
@Mikael17125 I've looked at the code and there does seem to be an issue. it hasn't considered the case where the ViT (Visual Transformer) doesn't need training. You can simply force the need_visual_encoder
flag to True
to address this.
https://github.com/InternLM/xtuner/blob/main/xtuner/model/llava.py#L574
That's works, I tried to finetune the ViT and it's works well.
After finetuned, I can convert the .pth to
official
andxtuner
format, however I cannot convert tohuggingface
format because some errors, please help me:Here is my finetuned config: