facebookresearch / pytorchvideo

A deep learning library for video understanding research.
https://pytorchvideo.org/
Apache License 2.0
3.22k stars 395 forks source link

RoI head for Vision Transformer #202

Open yuxin212 opened 2 years ago

yuxin212 commented 2 years ago

🚀 Feature

Please consider adding RoI head for Vision Transformer, which can be used for action detection using Vision Transformer.

Motivation

Performance of MViT on the AVA dataset is better than methods based on conv nets, like Slow/ResNet. But currently there are only implementation of RoI heads for Slow and SlowFast.

Pitch

A function/class similar to the ResNet RoI head, creates the RoI head for Vision Transformer.

vkrishnamurthy11 commented 1 year ago

Took a brief look at this.

I think we could use the RoI code found in here

There are some differences but I think it's a good starting point. Curious to know your thoughts!