The question is more related to the paper content. Am I right, that you are using literally 4 different PSPNets with Resnet34 in TD4 version? So for every new obtained frame of the video flow you should process once again 3 previous and the current frames, instead of just taking the features from the previous batch iteration? Is it crucial to use 4 different instances of the networks? Have you tried to use a single instance of an encoder (at least, or the whole segmentation model) and with any new frame to use already predicted features from the previous frames only reassigning them to the different attention module instances?
The question is more related to the paper content. Am I right, that you are using literally 4 different PSPNets with Resnet34 in TD4 version? So for every new obtained frame of the video flow you should process once again 3 previous and the current frames, instead of just taking the features from the previous batch iteration? Is it crucial to use 4 different instances of the networks? Have you tried to use a single instance of an encoder (at least, or the whole segmentation model) and with any new frame to use already predicted features from the previous frames only reassigning them to the different attention module instances?