Open nicolas-dufour opened 3 days ago
Which base model are you referring to?
Most multimodal reward models are CLIP-based so they should generally support 60~77 max token lengths.
The ones I've trained SiglipPreferenceScorer
and CLIPPreferenceScorer
should support more than 40 text tokens.
If you're interested in models with models with longer prompts, I can train one on top of jina-embeddings-v3 which support up to 8192 tokens.
Also, if you have suggestions for good backbones for multimodal reward models, happy to train one and support those too.
Hey, I was thinking of Image rewards which has a 35 tokens cut off : https://github.com/THUDM/ImageReward/blob/2ca71bac4ed86b922fe53ddaec3109fe94d45fd3/ImageReward/ImageReward.py#L110
It would be indeed super useful to have a long version of this models but i think the recent jina clip v2 is a better base model (jina embedding don't have an image tower i believe) https://huggingface.co/jinaai/jina-clip-v2.
This would be a great ressource to have, as it would help with scoring synthetic prompts!
I actually didn't add ImageReward yet. But, since it's a standard model used for image generation benchmark, I will be adding that in coming weeks. I will probably train jina-clip-v2
based image preference model on pick-a-pic-v2 soon.
I will keep the issue open until I implement those.
Hey, Out of this models, do they support longer texts? Some of the base reward models support less than 40 text tokens!