I have a question. - Githubissues

MIV-XJTU / ARTrack

Apache License 2.0

228 stars 33 forks source link

I have a question. #38

Closed NJiHyeon closed 7 months ago

NJiHyeon commented 8 months ago

Hello, thank you so much for sharing a good model. I left a message like this because I had a question while studying the code.

Why take a search/template image, extract a feature with a vit, and then "encode each feature again from the head to the encoder" without moving directly to the decoder?

I'm asking because this is not included in the paper. Thank you.

ARTrackV2 commented 8 months ago

It's for a more stable training process, if you want to get the same accuracy without using this extra encoder, you can use the attention mask like in ARTrackV2. I think it is the best solution.

NJiHyeon commented 7 months ago

Thank you for your response. !! I have another question. When the seqs_input in the artrack head part is not None, it seems like a training stage, so why use the " 'feat': at" part at once without learning by predicting the coordinates sequentially like val/test??

AlexDotHam commented 7 months ago

sequentially training will lead to infinite usage of GPU memory, thus we detach each batch in temporal.

NJiHyeon commented 7 months ago

Thank you so much for your quick response!!

For example, if I get data for [batch_size,4] as input when I train, is it impossible to train by learning and updating xmin, learning and updating ymin? (like in the val/test way)

NJiHyeon commented 7 months ago

In the ARTrack_seq model, when training, I think It learn like the val/test head part of the ARTrack model

AlexDotHam commented 7 months ago

I think it is possible, if you do this, you need to detach the image feature and xmin when you updating ymin. That's for reducing the cumulative calculation of the computation graph

NJiHyeon commented 7 months ago

hello, When training the artrack_seq model, why do you calculate self.actor.explore(data) and hand it over to loss instead of calculating forward_pass and loss directly like artrack? And why do you do network work in the expore() part and network work in the loss part?

AlexDotHam commented 7 months ago

Explore is a kind of simulator to the evaluation, we try to get the real action of tracker. For example, give an sequence of frames 0, 1, 2, 3, 4 ... , 32, if we want to get the real action of tracker in the 2 frame, we should evaluate the tracker in the 0 frame to get the bounding box and prompt the 1 frame to get the specific box, after that the 0 and 1 frames box as the prompt for the 2 frame. Explore part do the things as before mentioned, after that if we directly calculate the loss in explore pare, there is a problem for training, when calculate the loss in the 2 frame, the calculate graph will include 1 frame and 0 frame, which make rapid expansion of graphics memory. As a compromise, we save the trajectory in each frame to detach each coordinate tokens in different frames, and introduce the forward_pass to execute the training.

NJiHyeon commented 7 months ago

I understand what you mean, thank you so much.

NJiHyeon commented 7 months ago

(Frame 0 → Frame 1 → Frame 2 → … Frame 31 → Frame 32) <- Are you saying that you predict coordinates like this?? Then, is it impossible to predict the coordinates for frame 32 after extracting the features for the coordinates from frame 0 to frame 31 at once? (Like time series prediction)

ARTrackV2 commented 7 months ago

As u said, i save the predict coordinates as you mentioned. The true reason is that we want to simulate the evaluation in tracking to reduce the bias in traditional training and inference. As we know, the frame 1's search region is cropped according to the frame0's prediction in evaluation time. So if we conduct the groundtruth as the cropped region, the tracker can not learn how to refine the tracking in real evaluation scenaiors but study how to overfit into the groundtruth. The groundtruth label lead the training and inference gap, because when inference, the frame 1's region is cropped rely on previous predict, but in training we imply the tracking is pefect.