Open sanketsans opened 1 year ago
hello, I was wondering if you could run this code? I find the code have some bugs.
me too, when I tried to run train code with AVA dataset using MVIT_16X4.yaml file, I got an error getting unexpected keyword argument 'drop_rate'. And also having trouble downloading Something-Something V2 and SomethingElse dataset cause it has 503 error on its downloading webpage. Is there any way to solve these issues??
Hello, In the paper, it is mentioned that the in the ORVIT block the object region attention is carried out by different q, k and v values i.e; q is set to the patch tokens and k,v are set as the concatenated tokens from the patches and the object regions.
X = THWd , C = T(HW+O)d
So, in the object-region attention; it should be (acc to the paper) : Q = XWq; k = CWk; V = CWv
However, in the code, I realize that the concatenated tokens are being passed to the trajectory attention module. https://github.com/eladb3/ORViT/blob/3bfd2c707293f3187337cacdcf0ce538986627d8/slowfast/models/ORViT/orvit.py#L149
Also, in the trajectory attention module, https://github.com/eladb3/ORViT/blob/3bfd2c707293f3187337cacdcf0ce538986627d8/slowfast/models/attention.py#L479 , the q, k and v values are set as identical to the ones from the concatenated tokens.
Can you please help me explain this ? I cant seem to find where the original patch tokens are set to the q for the trajectory attention mechanism.
Thanks :)