Closed MelanTech closed 5 months ago
In fact, we face the challenge you provided. We introduce a different solution: crop the object from the search region, then send the cropped image to ViT's projector, which is equal to the template projector. Thus, the output's feature is constrained to reconstruct the cropped image feature after the template projector. The benefits of this method are: 1) implemented direct alignment of input and output in autoregression. 2) Constructed a more challenging training task to understand the fine-grained content that the tracker truly captures when extracting templates from the feature level. It is worth noting that we only conducted reconstruction training during the sequence training stage, which ensures that the template projector provided in the first stage is good enough to extract fine-grained features of the template.
This is a great solution, but I have one more question. Does "send the cropped image to ViT's projector, which is equal to the template projector" refer to replacing the template in the input data with the cropped image and sending it to the ViT backbone and using the new template token as the reconstruction target? Or is it just inputting the cropped image to obtain the reconstruction target?
During the training process, we use a cropped image as a reconstruction target to calculate loss. In the sequential training stage, when predicting frame 2, we update the appearance tokens using the prediction of frame 1 instead of using the ground truth to crop the search region and send it to the projector.
Okay, I understand. Thank you for your patient and meticulous answer!
Hello, I encountered some issues with the appearance evolution strategy when reproducing ARTrackV2. The paper mentions that the reconstruction decoder reconstructs the appearance of the target into a search area feature map cropped based on the object's position. I have two questions:
Thank you for providing the code, looking forward to your response.