CLIP-T scores are much lower than what other papers reported

TencentARC / CustomNet

Apache License 2.0

258 stars 9 forks source link

CLIP-T scores are much lower than what other papers reported #2

Closed askerlee closed 4 months ago

askerlee commented 9 months ago

Nice work and impressive example images. But when I come across the quantitative results, I saw the reported CLIP-T of baseline methods are generally far off from what they were reported in the original papers. For example, DreamBooth reports a CLIP-T of 0.305, and BLIP-Diffusion reports 0.32.

May I know what's the reason behind this? Thanks.

jiangyzy commented 9 months ago

We use only the backgound texts to compute CLIP-T which don't containing the object category texts, while the images contain the objects, so the CLIP-T are different from other reports.

askerlee commented 9 months ago

I see. Thanks for the explanation. But what if the subject is not rendered, and only the background is rendered? For example "a dog in front of a blue house", will an image with a blue house only also receive a high CLIP-T score, according to your scheme?

jiangyzy commented 9 months ago

Yes, but this case hasn't ever observed among these models during comparison. So the DINO-I and CLIP-I are also important metrics.

askerlee commented 9 months ago

I see. Thanks. When I evaluate other models, such cases sometimes happen.

jiangyzy commented 9 months ago

Thank you for your valuable suggestions.