Please consider adding RoI head for Vision Transformer, which can be used for action detection using Vision Transformer.
Motivation
Performance of MViT on the AVA dataset is better than methods based on conv nets, like Slow/ResNet. But currently there are only implementation of RoI heads for Slow and SlowFast.
Pitch
A function/class similar to the ResNet RoI head, creates the RoI head for Vision Transformer.
🚀 Feature
Please consider adding RoI head for Vision Transformer, which can be used for action detection using Vision Transformer.
Motivation
Performance of MViT on the AVA dataset is better than methods based on conv nets, like Slow/ResNet. But currently there are only implementation of RoI heads for Slow and SlowFast.
Pitch
A function/class similar to the ResNet RoI head, creates the RoI head for Vision Transformer.