Closed chenzhu005774 closed 2 months ago
It won't work.
Well, thehidden_size
of CLIP and SigLIP are different and the shape of mm_projector
depends on hidden_size
.
Actually, even if the shapes are all the same, the model won't work well because it's trained to map vision feature of one model to LLM but it tries to map another kind of vision feature to LLM when in inference.
I read the ReadME file and it seems to support CLIP。
It means that this codebase supports training a Bunny with CLIP as the vision backbone.
I want to change the image parsing model to openai's clip-vit-large-patch14-336. I directly replaced the mm_vision_tower in the config.json file of bunny's model for openai's clip-vit-large-patch14-336 path that did not work. (https://huggingface.co/openai/clip-vit-large-patch14-336/tree/main)