TIBHannover / MSVA

Deep learning model for supervised video summarization called Multi Source Visual Attention (MSVA)
MIT License
41 stars 17 forks source link

Issue regarding the last step of self attention (weighted sum step) #3

Open pangzss opened 3 years ago

pangzss commented 3 years ago

Hi, I noticed that the last step of the self-attention calculation doesn't seem so right:

att_weights_ = nn.functional.softmax(logits, dim=-1)       
weights = self.dropout(att_weights_)     
y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

So here the softmax probability is calculated along the dim -1, which is the column direction. But then the weighted sum is taken along the row direction according to this line

y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

I think we should do something like this

y = torch.matmul(weights,V)

How do you think? I hope I'm the one to be corrected.

noirsora1605 commented 3 years ago

Hi, could you please guide me on how to summarize my own video?

Junaid112 commented 3 years ago

Hi, could you please guide me on how to summarize my own video?

you have to extract features first for frames of the video and then based on trained model you can predict probability to be in a summary. For object features refer https://github.com/VideoAnalysis/EDUVSUM/tree/master/src

Motion feature code I will upload after refactoring.

mpalaourg commented 3 years ago

I fully agree with @pangzss. If my calculations are right, the used formula/command

y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

would be correct, only if the weights array was symmetric, but this isn't the case. Oddly enough, the produced results doesn't change much when the corrected formula/command is used.