feinanshan / TDNet

Temporally Distributed Networks for Fast Video Semantic Segmentation
http://cs-people.bu.edu/pinghu/TDNet
MIT License
201 stars 45 forks source link

Do we actually need attention downsampling ? #10

Closed theevann closed 4 years ago

theevann commented 4 years ago

Hello!

If I understood well the paper, the attention map of previous frames are supposed to be propagated (and not recomputed) at each time step. Therefore, in practice, you should not need to recompute attention maps for previous frames and you can always use full resolution attention maps, can't you ? If so, I don't see the point of the downsampling.

In the code, it is not clear because you actually compute (downsampled) attention maps for the previous frames. But again, the idea is to not recompute these in practice...

Could you explain ?

feinanshan commented 4 years ago

Hi! As shown in Fig.3 in the paper, different subnetworks are alternatively applied over time, thus at each frame, a full feature representation needs to be recomposed from the previous several frames. To achieve this, for each frame full-res Q/K/V maps are computed (and reused). The downsampling step is applied to reducing the computation in the propagation phase.

Also note that there are differences between the training phase and the testing phase. In the demo codes (under ''testing'' folder), you will see that each frame's Q/K/V maps are only computed for once and reused by several following frames.

zoucheng1991 commented 3 years ago

so may I put it this way, if I use a shared subnet to extract features from different paths, then I won't need to recompute the qkv beyond the neighbouring frame