text_projection_shape: torch.Size([512, 512]) - image_projection_shape: torch.Size([1024, 512])

facebookresearch / SLIP

Code release for SLIP Self-supervision meets Language-Image Pre-training

MIT License

747 stars 69 forks source link

text_projection_shape: torch.Size([512, 512]) - image_projection_shape: torch.Size([1024, 512]) #4

Closed fractaldna22 closed 2 years ago

fractaldna22 commented 2 years ago

how can the cosine similarity between these two projections be calculated when the image is [1, 1024] and the text is [1, 512] -- these aren't compatible. the original clip has image and text both at 512. I'm trying to test clip-large model (ViT-L-16) to start but not having any luck to get it running

fractaldna22 commented 2 years ago

I figured it out. I accidentally was using model.visual(image) (which works in vanilla clip) instead of encode_image(image) -- when i corrected this to use encode_image, it outputs [1, 512] :)