JonathonLuiten / TrackEval

HOTA (and other) evaluation metrics for Multi-Object Tracking (MOT).
MIT License
975 stars 243 forks source link

Evaluation of tracks where not a single object was tracked #142

Closed jokober closed 5 months ago

jokober commented 10 months ago

I have two trackers:

  1. The first tracker gives results in which several sequences do not have a single object tracked. All scores for that sequence are zero (0) except for LocA which is one.
  2. Another tracker tracks, at least to some extend, all objects.

The AssA and HOTA results of both trackers are almost the same, although one tracker clearly gives better results. Apparently those results are not really ncorporated into the combined metric for all sequences, as those trackers with bad results are still reaching really good HOTA/AssA scores.

Is this intended or am I doing something wrong? What is the reason for that?

@JonathonLuiten Sorry for tagging, but I am just a few weeks from handing in my thesis and I realized this problem just now

jokober commented 10 months ago

Studying your code and in particular how the sequence results are combined it looks like the behavior decribed by me results from the fact that the combined AssA, AssPr and AssRe scores are formed by a weighted average based on true positives. However, since I have sequences in which there are no true positives, these sequences are not included in the calculation of the combined scores.

The corresponding lines of code:

https://github.com/JonathonLuiten/TrackEval/blob/12c8791b303e0a0b50f753af204249e622d0281a/trackeval/metrics/hota.py#L119-L129

https://github.com/JonathonLuiten/TrackEval/blob/12c8791b303e0a0b50f753af204249e622d0281a/trackeval/metrics/_base_metric.py#L61-L64

Of course I could change the implementation in such a way that the weighted average is formed based on gtIDs or TP+FN.

Looking on the definition of AssA, AssPr and AssRe and the concept of TPAs, FNAs and FPAs I am not entirely sure if sequences in which there is no True Positive, that are matched to both prediction detection and ground truth detection, should have any influence on the corresponding scores. Or in other words: Are completely missed tracks intendet to reduce AssA, AssPr and AssRe scores? Is HOTA even an appropriate metric for evaluation of trackers that might miss tracks completely?

The main problem I have is that completely missed tracks get too little penalty. Since they only affect DetRe and DetA in the current implementation, I could alternatively use "Weighted HOTA" and weight the respective compositions according to my requirements.

mesllo-bc commented 5 months ago

I too have also noticed this issue recently, in which trackers that have zero predictions also get abnormally high HOTA scores.

Case in point:

Video Name                         HOTA      DetA      AssA      DetRe     DetPr     AssRe     AssPr     LocA      OWTA      HOTA(0)   LocA(0)   HOTALocA(0)
example_1                                   0            0             0            0             0             0             0           0              0               0           
         0                       0         
example_2                          77.955    68.908    88.621    71.991    89.621    91.717    93.194    89.726    79.835    88.256    87.726    77.424    
COMBINED                           66.726    50.426    88.621    52.124    89.621    91.717    93.194    89.726    67.932    75.119    87.726    65.899    

Clearly the 0 scores are not affecting the combined average, which doesn't seem to make sense since the tracker had no detections to work on. In my opinion this should also reflect on the scores.

After changing the weight field to include TP + FNs instead of just TPs I get combined results that make more sense to me:

Video Name                         HOTA      DetA      AssA      DetRe     DetPr     AssRe     AssPr     LocA      OWTA      HOTA(0)   LocA(0)   HOTALocA(0)
example_1                                   0            0            0             0             0             0            0            0              0               0         
         0                        0         
example_2                           77.955    68.908    88.621    71.991    89.621    91.717    93.194    89.726    79.835    88.256    87.726    77.424    
COMBINED                           56.777    50.426    64.165    52.124    89.621    66.407    67.476    64.965    57.803    63.919    63.517    40.599    

I also noticed that LocA is set to output a score of 100% when there are no tracker detections according to the below code, which to me seems wrong since the tracker made zero predictions w.r.t existing ground-truths:

if data["num_tracker_dets"] == 0:
            res["HOTA_FN"] = data["num_gt_dets"] * np.ones(
                (len(self.array_labels)), dtype=float
            )
            res["LocA"] = np.ones((len(self.array_labels)), dtype=float)
            res["LocA(0)"] = 1.0
            return res

Why is this?

@JonathonLuiten could you kindly confirm whether the former configuration was intentional or if this is a potential bug? @jokober could you please confirm what you went for in the end? Thank you!

JonathonLuiten commented 5 months ago

This is not a bug. This is the correct and desired behaviour.

AssA literally measures how well the present detections are associated, it should not be weighted over non-present detections.

The overall HOTA score is adequately downweighted through the contributions in the DetA.

To understand this: HOTA^2 ~= sum_{i in detections}(Ass_IoU(i) / (FP_i + FN_i + FP_i))

What does this mean? You can think off HOTA**2 as the DetA score, where each TP in the numerator, instead of being given a score of exactly 1, is weighted by it's 'association iou'. Thus the association accuracy overall should only be averaged over the TP, and this will still give correct results overall.

TLDR: this is the correct behaviour and not a bug. Hope that makes sense :)

mesllo-bc commented 5 months ago

This is not a bug. This is the correct and desired behaviour.

AssA literally measures how well the present detections are associated, it should not be weighted over non-present detections.

The overall HOTA score is adequately downweighted through the contributions in the DetA.

To understand this: HOTA^2 ~= sum_{i in detections}(Ass_IoU(i) / (FP_i + FN_i + FP_i))

What does this mean? You can think off HOTA**2 as the DetA score, where each TP in the numerator, instead of being given a score of exactly 1, is weighted by it's 'association iou'. Thus the association accuracy overall should only be averaged over the TP, and this will still give correct results overall.

TLDR: this is the correct behaviour and not a bug. Hope that makes sense :)

I understand your explanation. But then shouldn't localization accuracy be 0 if the detector makes no detections?

JonathonLuiten commented 5 months ago

I guess here it's kind of 'undefined' really...

It should probably say 'undefined' instead of 1.

But note that the combination over multiple sequences is 100% correct, and doesn't weight a sequence with no predictions at all

mesllo-bc commented 5 months ago

Nevermind , I noticed that the DetA of cases with no predictions is still zero anyway so it still seems to work as intended.

JonathonLuiten commented 5 months ago

:)