Epiphqny / VisTR

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers
https://arxiv.org/abs/2011.14503
Apache License 2.0
739 stars 95 forks source link

Sequential processing of video frames in backbone #73

Open xserban opened 2 years ago

xserban commented 2 years ago

Hi @Epiphqny,

I have a question regarding your implementation, specifically the way you pass raw data to the backbone. If the input is a video with 36 frames, how are they being processed by a ResNet-50/101?

In the BackboneBase class, forward method, you are passing the tensor list to a backbone, but the backbone is expecting an input of size [64, 3, 7, 7]. As I understand from the paper you are reshaping the videos to [36x300x540] .. so there should be one more preprocessing step from the videos to the backbone.

Can you shed some light on this extra step?

Thanks!