MIV-XJTU / ARTrack

Apache License 2.0
228 stars 33 forks source link

test question #56

Closed NJiHyeon closed 5 months ago

NJiHyeon commented 5 months ago

When testing, if you predicted [xmin, ymin, xmax, ymax] at t-frame, do you put the predicted value at t+1 (t+1 frame) when you predict coordinates at t+1 (t+1 frame)? Or do you initialize and use only info['init_bbox'] to predict coordinates at t+1 (t+1 frame)?

AlexDotHam commented 5 months ago

In first-stage, we only use the start token without any coordinates prompts, but in second-stage, the prediction at the t-frame references the prediction from t-7 frame to t-1 frame. When you try to track in one video sequence, we will initialize all the coordinates prompts as the init_bbox from the beginning, and then update the prompts step-by-step.