JonathonLuiten / TrackEval

HOTA (and other) evaluation metrics for Multi-Object Tracking (MOT).
MIT License
965 stars 239 forks source link

HOTA seems to have non-intuitive association score properties #90

Open DerekGloudemans opened 2 years ago

DerekGloudemans commented 2 years ago

Consider as a thought experiment a ground truth object and two prediction cases.

Case 1: Detections are predicted that perfectly match the ground truth object positions. The first 90% have ID a, and the last 10% have random IDs. HOTA dictates that 90% are TPAs and 10% are FPAs so association component of HOTA is 0.9

Case 2: Detections are predicted that perfectly match the ground truth object positions. The first 50% have ID a, and the last 50% have ID b. HOTA dictates that 50% are TPAs and 50% are FPAs so association component of HOTA is 0.5.

To me, it is not intuitive that one of these cases is vastly superior to another, or if anything Case 2 indicates intuitively "better" performance. It would seem intuitive that for a very long ground truth track, one might expect the predictions that are associated with that track to be fragmented (multiple IDs assigned). It seems a bit strange to only score the longest such fragment as TPAs and to score all other associated fragments as FPAs. In most hard tracking problems fragmentation is almost guaranteed. Thus, I find the association component of HOTA not that useful as defined.

Can you provide a justification for this choice, or, if I am misunderstanding something, a clarification for the actual calculation? The definitions in the original paper are at times a bit hard to follow because c and k are never explicitly defined.

JonathonLuiten commented 2 years ago

You are misunderstanding one thing: The 'not best' ID, DOES contribute.

Let's take 4 cases (all with perfect detection), all having a length of 1000: a) 500 with ID A, the rest of the 500 with each a unique ID per detection. b) 500 with ID A, the rest of the 500 with ID B c) 900 with ID A, the rest of the 100 with each a unique ID per detection. d) 900 with ID A, the rest of the 100 with ID B

The AssA scores for these are: a) 0.5 0.5 + 0.5 0.001 = 0.2505 b) 0.5 0.5 + 0.5 0.5 = 0.5 c) 0.9 0.9 + 0.1 0.001 = 0.8101 d) 0.9 0.9 + 0.1 0.1 = 0.82

Each of the above are calculated as such: sum_IDs (percent_of_dets * correctness_of_dets_ass)

So the difference is not 0.5 vs 0.9 as you say, but 0.5 vs 0.81

The reason for the problem you are highlighting, is that the score is 'normalized' over the number of detections. E.g. here HOTA is about 'how much is each detection correct'. There is an easy adaptation to normalize over 'tracks'/'IDs', which would have the property you desire, but isn't really applicable in every situation.

ShkarupaDC commented 2 years ago

Is there a HOTA version where we have much more similar scores for b and c cases? My task is more sensitive to the number of ID switches than to the relation between tracks' lengths

JonathonLuiten commented 2 years ago

As you said in an earlier comment (that I can no longer see), the Fragmentation-Aware HOTA might be what you are looking for here.

sieumap43 commented 1 year ago

As you said in an earlier comment (that I can no longer see), the Fragmentation-Aware HOTA might be what you are looking for here.

Is it implemented in your code? I cannot find it anywhere