RyanWangZf / MedCLIP

EMNLP'22 | MedCLIP: Contrastive Learning from Unpaired Medical Images and Texts
394 stars 41 forks source link

problems when run demo code #12

Closed lln556 closed 1 year ago

lln556 commented 1 year ago

Thanks for your sharing

After downloading pre-trained Model, i run the sample code "As simple as using CLIP" and make no modifications. Then an error appears:

Traceback (most recent call last): File "D:/Projects/MedCLIP/11.py", line 20, in outputs = model(*inputs) File "D:\Programming environment\Python--3.8.9\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "D:\Projects\MedCLIP\medclip\modeling_medclip.py", line 216, in forward logits_per_image = self.compute_logits(img_embeds, text_embeds) File "D:\Projects\MedCLIP\medclip\modeling_medclip.py", line 230, in compute_logits logits_per_text = torch.matmul(text_emb, img_emb.t()) logit_scale RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x768 and 512x1)

I haven't used hugging face before, whether there are problems in the model of bert or other problems?

Waiting for your reply Thanks Sincerely

lln556 commented 1 year ago

There is a comment on line 41 of the document model_medclip.py, after i uncomment this line, the program is running normally. Am I doing the right thing?Sincerely waiting your answer.

RyanWangZf commented 1 year ago

Thanks for noticing! You can fix it by uncommenting it. You can use MedCLIPVisionModelViT normally. The problem comes from the different pretraining strategies of ViT and ResNet-based models. I will fix the ResNet version later.

QtacierP commented 1 year ago

It seems like that the logits from resent50 pre-trained weights cannot work well. I get an extremely low accuracy for prompt classification when I use resnet50. However, it works pretty well for Vit backbone.