Closed tqvinhcs closed 4 years ago
The input shape are: video_embd: batch x D text_embd: (batch * number_of_positive_candidates) x D where D=512 number_of_positive_candidate = 5
The transpose is to have negative for both the video and the text. If you do not concatenate the transpose you will only get either the negative for the video OR the negative for the text.
I see. Thank a lot for your clarification.
Hi,
1) What is the input for the MILNCELoss function? Is it like this: video_embd: batch x D text_embd: batch x D where D=512?
2) Why do you cat the x and x transpose here? denominator = th.cat((x, x.permute(1,0,2)), dim=1).view(x.shape[0], -1) isn't that th.logsumexp(x, dim=1) already computed the log sum in the denominator?
Thanks