RE-N-Y / imscore

Minimal Differentiable Image Reward Functions
23 stars 0 forks source link

Long context rewards #1

Open nicolas-dufour opened 3 days ago

nicolas-dufour commented 3 days ago

Hey, Out of this models, do they support longer texts? Some of the base reward models support less than 40 text tokens!

RE-N-Y commented 3 days ago

Which base model are you referring to? Most multimodal reward models are CLIP-based so they should generally support 60~77 max token lengths. The ones I've trained SiglipPreferenceScorer and CLIPPreferenceScorer should support more than 40 text tokens.

If you're interested in models with models with longer prompts, I can train one on top of jina-embeddings-v3 which support up to 8192 tokens.

Also, if you have suggestions for good backbones for multimodal reward models, happy to train one and support those too.

nicolas-dufour commented 3 days ago

Hey, I was thinking of Image rewards which has a 35 tokens cut off : https://github.com/THUDM/ImageReward/blob/2ca71bac4ed86b922fe53ddaec3109fe94d45fd3/ImageReward/ImageReward.py#L110

It would be indeed super useful to have a long version of this models but i think the recent jina clip v2 is a better base model (jina embedding don't have an image tower i believe) https://huggingface.co/jinaai/jina-clip-v2.

This would be a great ressource to have, as it would help with scoring synthetic prompts!

RE-N-Y commented 2 days ago

I actually didn't add ImageReward yet. But, since it's a standard model used for image generation benchmark, I will be adding that in coming weeks. I will probably train jina-clip-v2 based image preference model on pick-a-pic-v2 soon. I will keep the issue open until I implement those.