Open jspsiy opened 2 weeks ago
Hi, this work was done a long time ago and I could hardly remember what we explored. I do remember we tried ViT-based CLIP but I forgot the results. If you could share some details (including implementation details and results), I would be happy to analyze and provide some intuition.
I'm curious to know if i can use this clip to replace other clips in a network. I'd also like to replace the Transformer to ViTH . So far, i tried it and i don't have much luck. So i wanted to ask if this only works for RN50?