SImilarities computed using motion and text embeddings are incorrect

sohananisetty commented 1 year ago

Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.

Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',

emb = enc.encode_motions(batch['x']).to(device)
emb /= emb.norm(dim=-1, keepdim=True)

text_inputs = torch.cat([clip.tokenize(c) for c in batch["clip_text"]]).to(device)
text_features = clip_model.encode_text(text_inputs).float()
text_features /= text_features.norm(dim=-1, keepdim=True)

logit_scale = clip_model.logit_scale.exp()
similarity = (logit_scale * emb @ text_features.float().T).softmax(dim=-1)

values, indices = similarity[0].topk(len(batch["clip_text"]))

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{batch['clip_text'][index]:>16s}: {100 * value.item():.2f}%")

Expected output for similarity[0] -> high "jump" probability But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?

GuyTevet commented 1 year ago

That's weird. Your code looks good to me, but we do know that the cosine similarity should work to some extent according to the action classification experiment. Did you try using it as a reference?

sohananisetty commented 1 year ago

I ran the script using the general model. I was getting:

Top-5 Acc. : 29.86%  (637/2133)
Top-1 Acc. : 13.41%  (286/2133)

Using the finetuned model:

Top-5 Acc. : 63.72%  (1354/2125)
Top-1 Acc. : 44.99%  (956/2125)

I assumed the zero-shot nature of CLIP would at least provide some generalizability. But that does not seem to be the case.

GuyTevet / MotionCLIP

SImilarities computed using motion and text embeddings are incorrect #27