Closed theevann closed 4 years ago
Hi! As shown in Fig.3 in the paper, different subnetworks are alternatively applied over time, thus at each frame, a full feature representation needs to be recomposed from the previous several frames. To achieve this, for each frame full-res Q/K/V maps are computed (and reused). The downsampling step is applied to reducing the computation in the propagation phase.
Also note that there are differences between the training phase and the testing phase. In the demo codes (under ''testing'' folder), you will see that each frame's Q/K/V maps are only computed for once and reused by several following frames.
so may I put it this way, if I use a shared subnet to extract features from different paths, then I won't need to recompute the qkv beyond the neighbouring frame
Hello!
If I understood well the paper, the attention map of previous frames are supposed to be propagated (and not recomputed) at each time step. Therefore, in practice, you should not need to recompute attention maps for previous frames and you can always use full resolution attention maps, can't you ? If so, I don't see the point of the downsampling.
In the code, it is not clear because you actually compute (downsampled) attention maps for the previous frames. But again, the idea is to not recompute these in practice...
Could you explain ?