Closed Kou-99 closed 5 months ago
For question 1: The identity embeddings is a [5, dim]'s tensor, specific for search, template, appearance token, confidence token, and trajectory to help encoders to distinguish each token. For question 2: Yes, the appearance tokens are the same length as a template for convenient alignment for MSE loss. For question 3: We only conduct learnable tokens with random initialization, we try to use Xavier or Kaiming norm, but there is limited influence. For question 4: The reconstruction decoder's structure is the same as MAE, when using vit-base or vit-large we directly reused the pre-training parameters of the MAE decoder respectively. Moreover, I am not sure if can I show you the training logs, but I can tell you the accuracy in first stage. For convenience, we only present the got10k-train-only performance. ARTrack{b256} AO:73.1% ARTrack{b384} AO: 74.9 ARTrack_{l384}: 76.9 If you want to reproduce it, I think you must add 3 layers of self-attention after VIT, and make sure the backbone layers with smaller learning rate with 0.1x linear decay or 0.9x layer decay.
If you have any other questions, feel free to ask me or email me, i will try my best to solve that.
Thank you for your quick and detailed responses!
Hi, thanks for your inspiring work! I am trying to reimplement ARTrackV2 for my recent project. I've encountered some uncertainties regarding certain implementation details. I'm hoping you could provide some guidance on the following points:
Additionally, given that the code cannot be made public in recent future, would it be possible for you to share training logs or intermediate results (such as accuracy of the frame-level pretrain model) to assist us in validating our implementation?
Your assistance in clarifying these points and providing further insights would be greatly appreciated. Thank you in advance for your time and support!