Closed chenrxi closed 2 years ago
I have noticed this differences. Can you share your thinking for me ? @chenrxi
(1) The reason why we multiply by a (1-target) is to reduce training ambiguities. 'target' is obtained by a Gaussian function. The more closely the pixel locates next to the center, the more the (1-target) approaches to zero. We want to less penalize the pixels around the previous center, as the nearby pixels could also belong to the corresponding object and should not be simply regarded as negative samples.
(2) In the paper, we stated "We optionally incorporate the residual feature as the input of to provide more motion clues". The results will slightly degrade after removing the residual feature.
(1)the formula (4) demonstrate how the cva loss is calculated, however, it is different from how the cva loss is calculated in your code(see more in ./lib/model/losses.py). First, you maxpool attention matrix in h and w dimension, and softmax after multiplying (1-target) and index the position of the previous location in the vector as the output to calculate loss. Why do you choose the latter method. (2)In this paper, you said that you use tracking information to track. This tracking information refers to the tracking offset calculated by the CVA module. Then why did you add feat diff? If you don't add feat diff, What happened to the result?