how can the cosine similarity between these two projections be calculated when the image is [1, 1024] and the text is [1, 512] -- these aren't compatible. the original clip has image and text both at 512. I'm trying to test clip-large model (ViT-L-16) to start but not having any luck to get it running
I figured it out. I accidentally was using model.visual(image) (which works in vanilla clip) instead of encode_image(image) -- when i corrected this to use encode_image, it outputs [1, 512] :)
how can the cosine similarity between these two projections be calculated when the image is [1, 1024] and the text is [1, 512] -- these aren't compatible. the original clip has image and text both at 512. I'm trying to test clip-large model (ViT-L-16) to start but not having any luck to get it running