Closed lixueqidaytoy closed 4 months ago
We did not train UVLTrack on modality-specific datasets. UVLTrack aims to unify visual and vision-language tracking into a single model. As a result, UVLTrack can utilize different modal references to train the model, which is an advantage of our method. However, JointNLT is specifically designed for vision-language tracking, which can not (or needs to modify some modules) be trained with datasets without language description, such as TrackingNet.
The performance of your work is very impressive! In your paper, you said UVLTrack was trained on GOT-10k, COCO2017, TrackingNet and other datasets. But other Vision-Language Tracker like JointNLT was just trained on LaSOT, OTB, TNL2K and RefCOCOg-google. I want to know did you trained on just these datasets like JointNLT and compared the performance with JointNLT? In your paper, you say JointNLT doesn't perform very well on LaSOT without natural language. Did you retrain JointNLT on datasets like TrackingNet? TrackingNet is very large.