MIV-XJTU / ARTrack

Apache License 2.0
228 stars 33 forks source link

About the appearance evolution implementation details of ARTrackV2 #62

Closed MelanTech closed 5 months ago

MelanTech commented 5 months ago

Hello, I encountered some issues with the appearance evolution strategy when reproducing ARTrackV2. The paper mentions that the reconstruction decoder reconstructs the appearance of the target into a search area feature map cropped based on the object's position. I have two questions:

  1. Does "the feature map of the search region cropped based on the object’s position" refer to cropping an 8x8 size feature map using the center point of the gt bbox as the reference?
  2. If the above assumption is true, how to deal with the situation where the bbox center is at the boundary of the feature map? At this point, it will not be possible to crop a feature map of sufficient size. Has zero padding been used to compensate for the missing boundary?

Thank you for providing the code, looking forward to your response.

AlexDotHam commented 5 months ago

In fact, we face the challenge you provided. We introduce a different solution: crop the object from the search region, then send the cropped image to ViT's projector, which is equal to the template projector. Thus, the output's feature is constrained to reconstruct the cropped image feature after the template projector. The benefits of this method are: 1) implemented direct alignment of input and output in autoregression. 2) Constructed a more challenging training task to understand the fine-grained content that the tracker truly captures when extracting templates from the feature level. It is worth noting that we only conducted reconstruction training during the sequence training stage, which ensures that the template projector provided in the first stage is good enough to extract fine-grained features of the template.

MelanTech commented 5 months ago

This is a great solution, but I have one more question. Does "send the cropped image to ViT's projector, which is equal to the template projector" refer to replacing the template in the input data with the cropped image and sending it to the ViT backbone and using the new template token as the reconstruction target? Or is it just inputting the cropped image to obtain the reconstruction target?

AlexDotHam commented 5 months ago

During the training process, we use a cropped image as a reconstruction target to calculate loss. In the sequential training stage, when predicting frame 2, we update the appearance tokens using the prediction of frame 1 instead of using the ground truth to crop the search region and send it to the projector.

MelanTech commented 5 months ago

Okay, I understand. Thank you for your patient and meticulous answer!