Closed GerardMaggiolino closed 1 year ago
(1) We report scores after Tracktor pre-processing. One surprising result is that we show higher performance than Tracktor++ even though Tracktor++ involves training a re-identification network on the ground truth object track labels. Note that they have released newer Tracktor++ with improved detector that performs better.
(2) In some sense -- Tracktor pre-processing only initializes new tracks at public detections, but then uses a much stronger detector to determine the position of that object in successive frames. For Tracktor, this approach makes sense since the detector-based regression is core to their proposed method, although it also makes it hard to compare against other methods. I believe other approaches directly apply a detector to improve the input detections. We may have conflated Tracktor's pre-processing with pre-processing in these other methods in our description of it in our paper.
(3) The pre-trained YOLOv5 model can be downloaded here, from their repository: https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5x.pt
Thanks for the great work! Hopefully this paper inspires many more unsupervised approaches. I have a few general questions for @uakfdotb if you have time.
1.) I've noticed in most other public detection papers, there's references down to CenterTrack where it's stated bounding boxes can be propagated from existing tracks, refined, or deleted (score change). It seems SiamMOT and OUTrack_fm follow the same procedure. Since your approach is purely associational, are averaged public det scores reported in the paper on the true predicted bounding boxes for the three public detectors (or after Tracktor preprocessing)?
2.) I'm not sure if I'm understanding the DPM and FRCNN preprocessing steps - is Tracktor's FRCNN (described in their supplementary A.1) used to refine the detections? If so, I definitely agree that defeats the purpose of public detections a bit.
3.) It's noted that an off-the-shelf YOLOv5 trained on COCO is used during training. I'm hoping the weights for this could be provided, so future unsupervised work could be evaluated fairly (and improvements in performance aren't due to better pseudo-labeling).