dbolya / tide

A General Toolbox for Identifying Object Detection Errors
https://dbolya.github.io/tide
MIT License
702 stars 115 forks source link

TIDE output interpretation #27

Closed jinmingteo closed 3 years ago

jinmingteo commented 3 years ago

hi @dbolya,

i was testing out TIDE with 2 of my models (with slight different augmentations between them). The results are:

Model 1

 mask AP @ 50: 50.43

                         Main Errors
=============================================================
  Type      Cls      Loc     Both     Dupe      Bkg     Miss  
-------------------------------------------------------------
   dAP     5.05     5.61     0.21     0.00     3.73    14.52  
=============================================================

        Special Error
=============================
  Type   FalsePos   FalseNeg  
-----------------------------
   dAP       8.64      28.71  
=============================

Model 2

mask AP @ 50: 45.71

                         Main Errors
=============================================================
  Type      Cls      Loc     Both     Dupe      Bkg     Miss  
-------------------------------------------------------------
   dAP     5.09     3.76     0.05     0.00     3.54    14.56  
=============================================================

        Special Error
=============================
  Type   FalsePos   FalseNeg  
-----------------------------
   dAP       8.75      25.02  
=============================

I am a little confused that the dAP (except Miss) Model 2 (with 45.71 AP) are significantly lower than Model 1 (with 50.43 AP).. Is there a good intuition or interpretation of the aforementioned results? I would think Model 1 is better (given its mAP) but TIDE seems to suggest otherwise.

dbolya commented 3 years ago

Yeah, this output seems odd to me. TIDE doesn't really work for very small changes because what affects AP is fairly complicated, but that change seems to have caused a large change of AP.

I guess the intuition that you can pull from this is that the change didn't actually affect any one category of error specifically, and just generally made the network better. If your change didn't target one particular subset of the error categories, then the overall AP is more meaningful.

jinmingteo commented 3 years ago

thanks @dbolya! Will use overall AP as a first cut then TIDE main errors.