BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
799 stars 61 forks source link

Image analysis model change #71

Closed chenzhu005774 closed 2 months ago

chenzhu005774 commented 2 months ago

I want to change the image parsing model to openai's clip-vit-large-patch14-336. I directly replaced the mm_vision_tower in the config.json file of bunny's model for openai's clip-vit-large-patch14-336 path that did not work. (https://huggingface.co/openai/clip-vit-large-patch14-336/tree/main)

Isaachhh commented 2 months ago

It won't work.

Well, thehidden_size of CLIP and SigLIP are different and the shape of mm_projector depends on hidden_size.

Actually, even if the shapes are all the same, the model won't work well because it's trained to map vision feature of one model to LLM but it tries to map another kind of vision feature to LLM when in inference.

chenzhu005774 commented 2 months ago

I read the ReadME file and it seems to support CLIP。 image

Isaachhh commented 2 months ago

It means that this codebase supports training a Bunny with CLIP as the vision backbone.