VISION-SJTU / USOT

[ICCV2021] Learning to Track Objects from Unlabeled Videos
63 stars 7 forks source link

why detach(stop-gradient) the boxes for PrPooling #1

Closed florinshen closed 3 years ago

florinshen commented 3 years ago

The contribution of precise ROI-pooling can be concluded as: pooling more precise ROI features and can backpropagate the gradients to the coordinates.

We find you detach the boxes when pooling the results in the training pipeline at https://github.com/VISION-SJTU/USOT/blob/main/lib/models/models.py#L273. that's somewhat confusing for using cycle consistency to train an unsupervised tracking model. If dont detach these boxes, the model may acquire more info from the video-level training and can be better trained ? But now this part is detached, the model seems like the conventional pair-wise trained siamese tracking model. So can we ask, why this part is detached ? what if this part is not detached?

zhengjilai commented 3 years ago

Thanks for your great question.

The basic inference logic of bbox regression based Siamese tracker is that, only one output bbox is finally used for outputting the final bounding box. That is the box on the regression map with the highest score on the classification map. So the main reason for our design of bbox detachment is, we think it is not proper to pass gradient from only one spatial position of the regression map when enforcing cycle consistency. Note that when training the regression map like conventional Siamese networks, the gradients are passed backward from all bboxes whose spatial positions are inside the (pseudo) groundtruth bbox, so not detaching cycle gradient for bboxes may cause a misalignment problem in this aspect.

One fact I want to share is that, cycle consistency can be trained in either stop-gradient manner or non-stop-gradient manner. The former is the implementation philosophy of UDT, and the latter is the training paradigm of TimeCycle. In TimeCycle, gradient is passed backward both from the three output parameters for sampling grid (somewhat like the bbox of tracking result) and the deep feature. To satisfy this fully end-to-end design, these three output paramters are generated by the whole affinity matrix (only 3 elements are outputted and passing backward gradient), different from the Siamese design (25 25 4 elements in 25 * 25 spatial positions are output for bboxes and only one of them is finally chosen). You may refer to these paper for details.

We do not know what would happen if that detachment is removed. However, we really guess that depending on gradient from only a single position on the regression map is not enough.