eladb3 / ORViT

"Object-Region Video Transformers”, Herzig et al., CVPR 2022
Apache License 2.0
42 stars 12 forks source link

Object Region Attention #12

Open sanketsans opened 1 year ago

sanketsans commented 1 year ago

Hello, In the paper, it is mentioned that the in the ORVIT block the object region attention is carried out by different q, k and v values i.e; q is set to the patch tokens and k,v are set as the concatenated tokens from the patches and the object regions.

X = THWd , C = T(HW+O)d

So, in the object-region attention; it should be (acc to the paper) : Q = XWq; k = CWk; V = CWv

However, in the code, I realize that the concatenated tokens are being passed to the trajectory attention module. https://github.com/eladb3/ORViT/blob/3bfd2c707293f3187337cacdcf0ce538986627d8/slowfast/models/ORViT/orvit.py#L149

Also, in the trajectory attention module, https://github.com/eladb3/ORViT/blob/3bfd2c707293f3187337cacdcf0ce538986627d8/slowfast/models/attention.py#L479 , the q, k and v values are set as identical to the ones from the concatenated tokens.

Can you please help me explain this ? I cant seem to find where the original patch tokens are set to the q for the trajectory attention mechanism.

Thanks :)

malei207 commented 1 year ago

hello, I was wondering if you could run this code? I find the code have some bugs.

deschanel11 commented 1 year ago

me too, when I tried to run train code with AVA dataset using MVIT_16X4.yaml file, I got an error getting unexpected keyword argument 'drop_rate'. And also having trouble downloading Something-Something V2 and SomethingElse dataset cause it has 503 error on its downloading webpage. Is there any way to solve these issues??