About the implementation details of ARTrackV2

Kou-99 commented 5 months ago

Hi, thanks for your inspiring work! I am trying to reimplement ARTrackV2 for my recent project. I've encountered some uncertainties regarding certain implementation details. I'm hoping you could provide some guidance on the following points:

Identity embedding. I'm curious about how identity embedding is handled within ARTrackV2. Specifically, are different identity embeddings assigned for appearance tokens, confidence tokens, and trajectory tokens? Additionally, is the identity token used for trajectories the same as the command token?
Appearance tokens. Could you shed some light on the length of appearance tokens? Are they expected to be of the same length as the template?
Positional Embedding. I'm interested in understanding how positional embedding is initialized for appearance tokens, confidence tokens, trajectory tokens, and command tokens in the second stage of training (sequence-level training).
Model structure of the reconstruction decoder. It would be helpful to have insights into the model structure of the reconstruction decoder. Specifically, details such as the number of layers, number of heads, etc.

Additionally, given that the code cannot be made public in recent future, would it be possible for you to share training logs or intermediate results (such as accuracy of the frame-level pretrain model) to assist us in validating our implementation?

Your assistance in clarifying these points and providing further insights would be greatly appreciated. Thank you in advance for your time and support!

AlexDotHam commented 5 months ago

For question 1: The identity embeddings is a [5, dim]'s tensor, specific for search, template, appearance token, confidence token, and trajectory to help encoders to distinguish each token. For question 2: Yes, the appearance tokens are the same length as a template for convenient alignment for MSE loss. For question 3: We only conduct learnable tokens with random initialization, we try to use Xavier or Kaiming norm, but there is limited influence. For question 4: The reconstruction decoder's structure is the same as MAE, when using vit-base or vit-large we directly reused the pre-training parameters of the MAE decoder respectively. Moreover, I am not sure if can I show you the training logs, but I can tell you the accuracy in first stage. For convenience, we only present the got10k-train-only performance. ARTrack{b256} AO:73.1% ARTrack{b384} AO: 74.9 ARTrack_{l384}: 76.9 If you want to reproduce it, I think you must add 3 layers of self-attention after VIT, and make sure the backbone layers with smaller learning rate with 0.1x linear decay or 0.9x layer decay.

If you have any other questions, feel free to ask me or email me, i will try my best to solve that.

Kou-99 commented 5 months ago

Thank you for your quick and detailed responses!

MIV-XJTU / ARTrack

About the implementation details of ARTrackV2 #54