Closed gongbudaizhe closed 7 years ago
First of all, I have to admit that CFNet is the first official publication (in CVPR2017) of an end-to-end learning framework about CF.
Compared to CFNet that is end-to-end pre-trained on the same training dataset, DCFNet achieves a relative gain of 9.8% in AUC because it extracts features without resolution loss, carries out CF based appearance modeling and tracking consistently in the frequency domain.
The feature extractor of our DCFNet never reduce resolution (stride = 1). In simple terms, this may be a simple difference in network architecture design. I think that this is a very important factor for visual tracking, and if there is no border effect, a DCF operatation on a dense features can be interpreted as a approximation of Continuous Convolution. Besides, I have do a lot of experiment about network architecture and resolution. From our experiments, we observe that decreasing feature spatial resolution can cause a large reduction in the AUC accuracy. (33<63<125<169)
The CFNet is a improved version of SiamFC. The filter of CFNet learned is croped to a small size (17x17) for time-domain correlation, which will strongly harm the performance. So far, I have not seen the CFNet source code. I guess the main reason for the crop operation is to be consistent with SiamFC. (Just Imagine) Even if the training image and test image are the same image, the cropped filter may produce a bad response. For a normal CF (not SRDCF), there's no guarantee that the center part of filter are more effective.
In general, CFNet is a very good paper with perfect proofs and experimental controls.
Hi,
It seems that your work is closely related to CFNET, yet your performance is much better, 0.624 vs 0.568 in OTB100. Can you elaborate what makes such a big difference since CFNET also uses VID as training set and exponential decay learning rate schedule?
Thanks