Performance is inconsistent with the original model

Cestlaviez commented 2 years ago

Hi, thanks for providing this useful tool! However, I found that the result produced by the generated ONNX model is inconsistent with the original CLIP model. Here is the code I used to test the original model:

model, preprocess = clip.load("ViT-B/32", device="cpu", jit=False)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).cpu() # [1, 3, 224, 224]
text = clip.tokenize(["a diagram", "a dog", "a cat"]).cpu() # [3, 77]

image_features = model.encode_image(image)
text_features = model.encode_text(text)

logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).detach().cpu().numpy()

print("Label probs:", probs)

The result is: Label probs: [[0.9927937 0.00421069 0.00299573]]

However, when using the onnx model, the result is: Label probs: [[0.41456965 0.29270944 0.29272085]].

Could you help me with this? Thanks!

Lednik7 commented 2 years ago

Hey @Cestlaviez ! Yes, I know about this problem, but I do not know how to solve it. I am convinced that it is due to the following lines:

from clip_onnx import clip_onnx, attention
clip.model.ResidualAttentionBlock.attention = attention

The problem is that onnx doesn't want to export Multi-head attention layer. However, in most cases, the highest probability of the original corresponds to onnx.

zhangnju commented 2 years ago

if comment the line of "clip.model.ResidualAttentionBlock.attention = attention", it seems that onnx file could be exported successfully @Lednik7

Lednik7 commented 2 years ago

@zhangnju can you give an example code with the model?

Lednik7 commented 2 years ago

@Cestlaviez I updated the information in the readme, it should help

Lednik7 commented 2 years ago

CLIP-ONNX version 1.2, results are the same

Lednik7 / CLIP-ONNX

Performance is inconsistent with the original model #3