MCG-NJU / MeMOTR

[ICCV 2023] MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
https://arxiv.org/abs/2307.15700
MIT License
140 stars 8 forks source link

Input format-Training on one frame of the video clip? #16

Open sawhney-medha opened 5 months ago

sawhney-medha commented 5 months ago

Can you please elaborate on "The batch size is set to 1 per GPU, and each batch contains a video clip with multiple frames. Within each clip, video frames are sampled with random intervals from 1 to 10."

Does this mean the actual model is trained on one frame at a time randomly selected from the clip? I am trying to understand the actual input to the transformer encoder and decoder.

and What is the role of no_grad_frames?

Thank you!!

HELLORPG commented 5 months ago

In our experiments, batch_size refers to the number of video clips (samples). So the batch size is set to 1 per GPU means we process one video clip (which contains multiple frames) on a single GPU. And within each clip, the inter-frame interval is a random number from 1 to 10.

The no_grad_frames means these frames are forward in grad-free mode: https://github.com/MCG-NJU/MeMOTR/blob/f46ae3d0503b0579a08a11672295524315e74f06/train_engine.py#L217-L230 However, in our experiments, we deprecated this part. I have not deleted the code from this repo. My suggestion is not to pay attention to this process. Enabling it will not bring performance improvements.

sawhney-medha commented 5 months ago

Thank you for the prompt reply!! This is helpful

The input to the model (backbone and encoder/decoder) is a single frame at a time, right? So the way we use the temporal information from video clip is the track information/embedding and memory. am i understanding correctly?

Thank you again :)

HELLORPG commented 5 months ago

Yes. We process only one frame at each time step. The track embedding will propagate the temporal information.

The only difference is that during training, we will process multiple time steps before the optimizer.step(). In this way, the model can learn the ability of temporal modeling.

HELLORPG commented 5 months ago

This is equivalent to, in each training iteration, we process T time steps.

sawhney-medha commented 4 months ago

Thank you so much! Can you also please explain the working of the "process_single_frame" function? I want to understand how tracks are generated and how are sub clips connected to each other while predicting. Thank you!!

HELLORPG commented 4 months ago

Our model, as an online tracker, processes one-by-one for the image sequences. So, the function criterion.process_single_frame is used to process the criterion for a single frame at once. For example, as shown below: https://github.com/MCG-NJU/MeMOTR/blob/f46ae3d0503b0579a08a11672295524315e74f06/train_engine.py#L201 We will call this function (criterion.process_single_frame) T times in each training iteration, where T is the sampling length for each video clip (from 2 to 5 in our setting on DanceTrack).

At the same time, the function criterion.process_single_frame will also generate the track information (embed & ref_pts, etc.) for the next time step. As shown here: https://github.com/MCG-NJU/MeMOTR/blob/f46ae3d0503b0579a08a11672295524315e74f06/train_engine.py#L223-L227 It will update the tracked trajectories previous_tracks and the newborn trajectories new_tracks. Then, they will be combined into an overall tracks here: https://github.com/MCG-NJU/MeMOTR/blob/f46ae3d0503b0579a08a11672295524315e74f06/train_engine.py#L229-L230 Then, the tracks will be input into the next frame processing, like here: https://github.com/MCG-NJU/MeMOTR/blob/f46ae3d0503b0579a08a11672295524315e74f06/train_engine.py#L222 which connects the frames in the video clip by propagating the trajectories frame-by-frame. Therefore, our model can build a fully end-to-end training strategy and backward the gradients to the beginning (the first frame).