cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.74k stars 113 forks source link

HF transformers support #17

Open Iven2132 opened 4 months ago

Iven2132 commented 4 months ago

It would be cool if this model could have transformer support, also what is the specialty of this model? what is something that this model is good at?

ellisbrown commented 4 months ago

@Iven2132 do you mean integration into HF transformers itself?

this may be a bit challenging to support both GPU & TPU, but is something we may investigate later.

our current code depends on HF transformers and should be able to used with it.

we specifically target vision-centric capabilities, but our model is general-purpose. see more info on our site or in the paper https://cambrian-mllm.github.io/

Iven2132 commented 4 months ago

@ellisbrown Yes, I mean in the HF transformers itself. What are target vision-centric capabilities? Can It write code from a given UI etc?

ellisbrown commented 4 months ago

We didn't target generating code from a UI specifically. You can certainly try, but no guarantees there.

As for vision-centric capabilities: have a look at the benchmarks that we classify as "vision-centric" for a better idea—MMVP, Real World QA, and the CV-Bench we introduced.

You can read more about our CV-Bench benchmark in section 3.2. We test 4 different vision-centric capabilities. image

Iven2132 commented 4 months ago

@ellisbrown I'm still confused I did some visual answering with the 34b model and it performs very badly in that. The only model that passes that question are Gemini 1.5 pro/flash, gpt4-o, and Claude.

Then how Cambrian was evaluated?