Sequential processing of video frames in backbone

Hi @Epiphqny,

I have a question regarding your implementation, specifically the way you pass raw data to the backbone. If the input is a video with 36 frames, how are they being processed by a ResNet-50/101?

In the BackboneBase class, forward method, you are passing the tensor list to a backbone, but the backbone is expecting an input of size [64, 3, 7, 7]. As I understand from the paper you are reshaping the videos to [36x300x540] .. so there should be one more preprocessing step from the videos to the backbone.

Can you shed some light on this extra step?

Thanks!

Epiphqny / VisTR

Sequential processing of video frames in backbone #73