Open sohananisetty opened 1 year ago
That's weird. Your code looks good to me, but we do know that the cosine similarity should work to some extent according to the action classification experiment. Did you try using it as a reference?
I ran the script using the general model. I was getting:
Top-5 Acc. : 29.86% (637/2133)
Top-1 Acc. : 13.41% (286/2133)
Using the finetuned model:
Top-5 Acc. : 63.72% (1354/2125)
Top-1 Acc. : 44.99% (956/2125)
I assumed the zero-shot nature of CLIP would at least provide some generalizability. But that does not seem to be the case.
Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.
Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',
Expected output for similarity[0] -> high "jump" probability But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?