cheind / py-motmetrics

:bar_chart: Benchmark multiple object trackers (MOT) in Python
MIT License
1.37k stars 259 forks source link

Difference between official MOTChallenge code and this (with examples) #126

Open JonathonLuiten opened 3 years ago

JonathonLuiten commented 3 years ago

When I run the following file on this repo and the official MOTChallenge repo (https://github.com/dendorferpatrick/MOTChallengeEvalKit) I receive differing results.

Lif_T.zip

  | MOTA | MOTP | IDF1 | IDP | IDR | Rcll | Prcn | FP | FN | MT | PT | ML | FM | IDSW MOTCha | 66.98 | 89.09 | 72.35 | 88.77 | 61.06 | 68 | 98.85 | 2655 | 107803 | 679 | 595 | 364 | 1153 | 791 PYMOT | 67.0 | 0.109 | 72.4 | 88.8 | 61.1 | 68.0 | 98.90% | 2663 | 107797 | 693 | 581 | 364 | 1237 | 781

The float numbers don't say much because they the PYMOT numbers are rounded too much. However the integer numbers such at FP / FN / IDSW are different.

Initial experiments suggest that one possible cause may be the 'preprocessing' may be a cause of difference (https://github.com/dendorferpatrick/MOTChallengeEvalKit/blob/master/matlab_devkit/utils/preprocessResult.m)

cheind commented 3 years ago

Thanks for the report. Indeed, we have currently have two applications that behave differently in this respect. The first one is the most generic one:

https://github.com/cheind/py-motmetrics/blob/d261d16cca263125b135571231011ccf9efd082b/motmetrics/apps/eval_motchallenge.py#L81

and the other one was prepared via a PR to include preprocessing https://github.com/cheind/py-motmetrics/blob/d261d16cca263125b135571231011ccf9efd082b/motmetrics/apps/evaluateTracking.py#L132

Ideally those two should be merged as probably 80% of the code overlaps.

jvlmdr commented 3 years ago

I had a look and it seems like it is indeed due to the preprocessing. Specifically, the matlab eval kit does an initial pass with independent per-frame matching and removes all predicted boxes that are matched to a ground-truth box belong to a set of classes. For the "Lif_T" data provided, I found 14 matches to the "distractor" class (ID 8) and 8 matches to the "occluder_on_grnd" class (ID 10). This seems about the right size to explain the difference in FP and FN, but I will need to dig a bit deeper to be certain.

jvlmdr commented 3 years ago

In general, it would be good if we could add support for "ignore" regions in the annotations.

cheind commented 3 years ago

Hey Jack, thanks for the investigation. With ignore you mean: ignore if ann is matched in accumulator.update?

jvlmdr commented 3 years ago

@cheind We would need to discuss the design of ignore regions. It could be a region that is matched to 1 prediction, or it could be a region that excludes multiple predictions.

jvlmdr commented 3 years ago

Update on the difference in results. I performed an initial, independent, per-frame matching and excluded any predictions that matched ground-truth boxes in classes {2, 7, 8, 12}. This brought the FN and FP numbers closer but still not exact:

FP FN MT PT ML FM IDSW
matlab 2655 107803 679 595 364 1153 791
py-mot 2652 107800 693 581 364 1239 783
jvlmdr commented 3 years ago

Another update: It seems that the matlab code only preserves identities from the previous frame when calculating the correspondence for MOTA. This results in an inflated number of identity switches and a sub-optimal MOTA score. I modified the py-mot toolkit to do the same (although I believe it is not the intended behaviour) and obtained the following:

FP FN MT PT ML FM IDSW
matlab 2655 107803 679 595 364 1153 791
py-mot 2655 107803 692 582 364 1153 791

(To see this in the matlab code, check the usage of the M[t] variable in clearMOTMex.cpp.)

Now only the MT/PT/ML measures remain.

jvlmdr commented 3 years ago

And finally, it seems that the matlab toolkit uses ≤ 0.8 for "partially tracked" whereas py-motmetrics uses < 0.8. Here is the matlab code responsible. If I make this modification to py-motmetrics, it closes the gap completely.

I'm not sure where these metrics are officially defined. I would argue that if a track is exactly 80% tracked then we should put it in the "mostly tracked" category not the "partially tracked" category, but it's just my gut instinct ;)

JonathonLuiten commented 3 years ago

Hey hey. I also found the 'only preserves from previous frame' error earlier today, and this also resulted in my script getting the same result of Lif_T as the official script. Not 100% sure if this is a feature or a bug, but it is what it is. I definitely didn't imagine it being this way, hence coding it differently, but in some ways it makes sense.

However, my script is still getting different results on this MOT20 tracker (attached).

tracker20.zip

Now I think it is definitely the preprocessing, as when using the MATLAB script pre-processing but my eval code I get exactly the same results, otherwise when using my preprocessesing code I get 26 TPs more (out of around 1.7million total).

Let me know if your current version gets the same result at MATLAB script?

MATLAB SCRIPT RESULTS:

  | TP | FP | FN | IDSW | MOTP MOT20-01 | 12843 | 25 | 7027 | 44 | 90.343 MOT20-02 | 94320 | 289 | 60422 | 304 | 90.584 MOT20-03 | 256590 | 756 | 57068 | 259 | 85.615 MOT20-05 | 487082 | 2059 | 159262 | 980 | 85.293 OVERALL | 850835 | 3129 | 283779 | 1587 | 86.053