MCG-NJU / MeMOTR

[ICCV 2023] MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
https://arxiv.org/abs/2307.15700
MIT License
137 stars 7 forks source link

Can you please point to the code that tracks during inference? #18

Open sawhney-medha opened 4 months ago

sawhney-medha commented 4 months ago

I am confused about how the tracking is performed during inference for videos longer than sample length frames? What part of the code connects those shorter tracks?

HELLORPG commented 4 months ago

Our MeMOTR is an RNN-like model, processing the video frame-by-frame, like word-by-word in RNNs. So, in theory, the processing length is unlimited. Therefore, videos longer than sample length frames (during training) will not cause any difference in inference, still frame-by-frame.

So, we do not connect several shorter tracks into an overall trajectory. For the time step t during inference, we already have the past t-1 frames' trajectory and only need to connect these past tracks with targets in the current frame. Frame-by-frame is the key, not clip-by-clip (or, you can say, shorter-tracks-by-shorters-tracks).

However, inconsistent lengths during training and inference can indeed cause issues for the model. I further discuss this topic in my recent work.

hxchashao commented 1 month ago

Hello, may I ask what you said about the inconsistent length during training and inference. Can you explain in depth? My training dataset is 300 frames and the test set is 18,000 frames. When my test set reaches 1000 frames, there will be serious tracking confusion. Is this caused by the inconsistent length of the training dataset and the test set? Have you encountered such problems in previous experiments?

HELLORPG commented 1 month ago

I think that's not what I mean by inconsistent length. Let me explain it: During training, we only sampled 5 frames at most. Therefore, the longest occlusion does not exceed 3 frames during training. However, during inference, we need to deal with very long object occlusion situations (like 30-frame occlusion on DanceTrack, which is determined by the param MISS_TOLERANCE). This 3-frame occlusion vs. 30-frame occlusion is the inconsistency that I am trying to point out.

In your description, although your dataset is 300 frames long, for training, it's not different from a 5-frame clip. So, the inconsistency is not between 300 and 18000. But we have indeed not tried inference on 18000-frame videos, because this kind of data is extremely rare on MOT benchmarks. Videos that are too long may indeed cause unexpected situations. But I wonder if you can describe it in more detail? I do not quite understand the specific situation of the tracking confusion you are talking about.