Question about the pipeline inputs

jianingwangind commented 1 year ago

Thanks for sharing the great work!

May i ask that have you tried to use the 3D detection results as network inputs and maybe any denoise techniques as well? Thanks.

ziqipang commented 1 year ago

@jianingwangind Thanks for the question! We haven't tried to use detection results directly as the input. The following are my own thoughts.

From the aspect of research, end-to-end tracking instead of tracking-by-detection is more interesting. So I primarily focused on an end-to-end framework (as a Ph.D. student striving for papers lol).
You can replace any query-based detection head. In our code to be released, we will demonstrate how we make PF-Track compatible for both PETR and DETR3D heads. I hope that people can easily follow this practice and plug in newer and stronger detection heads (e.g. BEVFormer) in the future.

jianingwangind commented 1 year ago

@ziqipang Thanks for your quick reply.

I am asking the question because i saw that on NuScenes tracking leaderboard, tracking-by-detection was outperforming query-based tracker like MUTR3D by a large margin and also the success of MOTRv2 in 2D tracking.

And i have also tried to replace the DETR3Dhead in MUTR3D with PETRhead, regarding to the different query/ref_pts generation procedures, unfortunately it just didn't work , so it would be great to have a look at your code :) Thanks again!

ziqipang commented 1 year ago

@jianingwangind This is a good observation: a stronger detector can improve AMOTA, no matter if the tracker is better. I hope my code will help in forming a general interface to adapt query-based detection heads.

jianingwangind commented 1 year ago

Sure it will, so code release may happen in middle march? Looking forward to it.

jianingwangind commented 1 year ago

@ziqipang regarding to the pseudo code for track extension, i have two questions:

In B.2.

“tracker extension relies more on the frames with higher confidence”, but in the pseudo code, only last frame's motion predictions are used, what happened if detections from last frame are also low-confident?

also, here in line 13, if you tried to add the center positions for frame t with the movement from frame t to t + 1, shouldn't the result be the center positions for frame t +1? Is it a typo or i understand incorrectly?

Thanks.

ziqipang commented 1 year ago

@jianingwangind Thanks for the careful reading! I think the pseudo-code might need improvement on my next version.

From your feedback, I found a typo. Line 4 should be $Ct^i \xleftarrow{} C{t-1} + M_{t:t+1}^{t-1, i}$. This line indicates that we are ignoring the low-confidence detections and using previous predictions for propagation.
Regarding your first question, please look at Line 6. We replace the motion predictions in low-confidence frames with what we have. Thus, the predictions are not always from the latest frame, though the notations appear so.
The block from line 13 is the case for high-confidence objects. The result is indeed a propagated position for frame t+1. The algorithm's objective is to propagate the queries, and the output is the guessed positions for frame t+1. (I think I will update the notations to avoid confusion. You may think of the left-hand side of lines 4 and 13 as $Prop(C_t)$.)

jianingwangind commented 1 year ago

Thanks for your detailed explanation, and now i understand

ziqipang commented 1 year ago

@jianingwangind Please check out our latest release. Please also refer to the document of https://github.com/TRI-ML/PF-Track/blob/main/documents/detection_heads.md for integrating with other detection heads.

jianingwangind commented 1 year ago

@ziqipang Thousand thanks!

TRI-ML / PF-Track

Question about the pipeline inputs #2