I was just wondering how do you implement the spatiotemporal filters in the paper. It seems to me that all the convolution modules are conv2D, based on my limited knowledge of pytorch the spatiotemporal filters should be conv3D. Is there anything I'm wrong or missing?
The images are all greyscale rather than pixel values. Otherwise you would be correct. In our case, the input depth (traditionally the RGB dimension) is the time dimension.
I was just wondering how do you implement the spatiotemporal filters in the paper. It seems to me that all the convolution modules are conv2D, based on my limited knowledge of pytorch the spatiotemporal filters should be conv3D. Is there anything I'm wrong or missing?