请教一下，我用huggingface上的示例代码和同样的测试素材，跑出来的结果却和hugging-face在线demo跑出来的结果相去甚远，可能会是什么原因呢

qengli commented 1 year ago

这是我跑出来的结果：

这是hf上跑出的结果

明显hf上的结果准确性更高，我试过很多其他素材都是hf上要高

参考的代码也是hf上的示例：

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0219, 0.0316, 0.0043, 0.9423]]

使用的模型都是chinese-clip-vit-large-patch14-336px 代码里的注释的结果数据和我跑的是一样的[[0.0219, 0.0316, 0.0043, 0.9423]]，但huggingface上的结果更好，这是为什么呢

DtYXs commented 1 year ago

您好，在huggingface的Space中运行的时候您是否修改了Prompt模板呢？默认为一张{}的图片。，修改为{}则与您给出的代码的text相同，经过测试结果也更一致了。或者您可以将本地代码中的texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]，改为texts = ["一张杰尼龟的图片。", "一张妙蛙种子的图片。", "一张小火龙的图片。", "一张皮卡丘的图片。"]再试试看哈

qengli commented 1 year ago

确实如此，感谢！非常好的项目！加油！！

OFA-Sys / Chinese-CLIP

请教一下，我用huggingface上的示例代码和同样的测试素材，跑出来的结果却和hugging-face在线demo跑出来的结果相去甚远，可能会是什么原因呢 #119