MCG-NJU / MOTIP

Multiple Object Tracking as ID Prediction
https://arxiv.org/abs/2403.16848
Apache License 2.0
81 stars 7 forks source link

Question about the paper: Parallelized Training #10

Closed cygbbhx closed 3 months ago

cygbbhx commented 4 months ago

Hello, I read your paper and have several questions.

My question is about Section 4.2 / Appendix D - parallelized training. I think I haven't fully understood why the use of ID Decoder enables parallelized training.

  1. From my understanding, ID Decoder computes the prediction using Historical Trajectory information (and current detection token). Doesn't the historical trajectory still require serial processing?

  2. In the paper, it is noted that only 4 frames are used for gradient recording and others are passed through gradient-free mode. In Appendix D, I see 2 frames are processed with grad out of 5. How are these numbers (number of frames with grad/number of forward passes) calculated?

  3. Related to Question 2, why are only few frames selected with gradients? Does this mean the ID Decoder enables more robust learning, which allows sparser training of the detector?

Thank you for your great work and contribution!

HELLORPG commented 4 months ago

Thank you for your interest in our work. For your questions:

  1. The historical trajectory modeling does not need serial processing. As we implemented in trajectory_modeling.py, we only use FFN for trajectory modeling from DETR's output embeddings.
  2. In Appendix D, Fig 7 is just a schematic made for simplicity. So, the number of frames does not represent the real situation in our experiments. In our experiments, no matter how many frames are input, only 4 frames are used for gradient recording. This is due to the CUDA memory limitation while following the original Deformable DETR setting (BS=4 for each GPU).
  3. As I discussed in 2., the CUDA memory limitation is the main reason. In this setting (4 frames for gradient recording), most of our experiments can be run on 24GB GPUs without the gradient checkpoint. For "Does this mean the ID Decoder enables more robust learning, which allows sparser training of the detector?" you mentioned, I do not understand what this means. To be honest, I did this (sparse training of the DETR) just to reduce resource usage and speed up the model training.
cygbbhx commented 3 months ago

Thank you for your kind response! To ensure I understand correctly, I have two more questions I'd like to clarify:

  1. I understand that the target embeddings can be parallelized. According to the paper, the historical trajectory includes both target embeddings and ID embeddings. Then, is it correct to assume that ID embedding extraction is not parallelized as the ID Decoder requires ID embeddings from previous trajectories? Despite this, since DETR output embeddings can be processed in parallel, does this mean that the ID Decoder's processing time is relatively shorter than DETR's, thereby still allowing for an overall speed-up in the process?

  2. Regarding sparser training, then would updating DETR with every frame enhance the overall performance? I'm curious if previous related works typically relied heavily on frequent updates of DETR for optimal performance.

HELLORPG commented 3 months ago
  1. ID embedding prediction is not parallelized during inference (of course, online tracking), yet parallelized during training. The ID embeddings from previous trajectories can be directly obtained from GTs, not need for waiting previous processing. The only thing we need to do is utilizing attention masks during training to make sure we won't see the future ID embeddings when processing a specific frame.
  2. To be honest, I don't know what's going to happen. According to my experience, this may not significantly improve the performance of the model (I've tried this long time ago when T=19, only brings < 1.0 HOTA improvement). Training more frames at once may be analogous to increasing batch_size (although not exactly equivalent) in DETR models. In some newer DETR work (such as DAB-DETR), instead of using a larger batch_size (> 32), the batch_size is reduced (= 16). I think this is a trade-off, when you use a larger batch_size, you have to think about whether it is worth it.
cygbbhx commented 3 months ago

I see! Now everything makes sense. Thank you, I appreciate your detailed response.