AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.77k stars 7.96k forks source link

How to compute mAP of tiny yolo on VOC2007-test #350

Open szm-R opened 6 years ago

szm-R commented 6 years ago

Hi everyone, The title says everything, I want to compute the mAP of tiny yolo on VOC2007-test, I have written a cpp code for this and get 39.78% for mAP whereas pjreddie reports 57.1% mAP on VOC2007-test. I first downloaded the weights using: wget https://pjreddie.com/media/files/tiny-yolo-voc.weights

Then performed detection with: ./darknet -i 0 detector valid cfg/voc.data cfg/tiny-yolo-voc.cfg models/tiny-yolo-voc.weights

I just changed detector.c code to save the results in a different format that was easier for me to read in my code.

I then count all the TPs and FPs (in all classes) and compute Precision-Recall for 11 thresholds (from 0 to 1) and then the AP (with the formula mentioed in Pascal VOC paper). Here is the PR curve I get: prcurve_ap 39 7814

My true purpose is to write a code to compute the AP for the model trained on my own custom data, but in order to verify it I am testing on a pretrained tiny yolo.

Thanks in advance for your help.

AlexeyAB commented 6 years ago

@szm2015 Hi,

  1. Did you try to use https://github.com/AlexeyAB/darknet/blob/master/scripts/voc_eval.py and compare with your results?

  2. What validation dataset did you use, is it voc/2007_test.txt?

  3. Did you use such approach in your C-code?

szm-R commented 6 years ago

1- I will look into it and report, thank you.

  1. Yes.
  2. The number I reported was the AP taken over all classes, meaning that for every threshold I summed all TPs, FPs, and FNs from all classes (all counted the same as you mentioned) and then computed the overall Precision and Recall for every threshold and then computed the AP (also computed as you mentioned, for the points with no recall, like the recalls more than 60% in the image, I considered the precision 0).

Now I tried what you said about computing AP for every class and then averaging over them to get the mAP, here are the results:

AP of class "aeroplane": 44.0196 AP of class "bicycle": 47.6694 AP of class "bird": 28.4797 AP of class "boat": 20.7133 AP of class "bottle": 10.0164 AP of class "bus": 48.7149 AP of class "car": 48.0142 AP of class "cat": 49.9185 AP of class "chair": 14.9642 AP of class "cow": 38.3175 AP of class "diningtable": 34.7925 AP of class "dog": 41.3822 AP of class "horse": 48.7818 AP of class "motorbike": 37.5573 AP of class "person": 42.4456 AP of class "pottedplant": 15.4256 AP of class "sheep": 38.9875 AP of class "sofa": 19.6689 AP of class "train": 50.1913 AP of class "tvmonitor": 43.5746 mean Average Precision: 36.1818

Now it's even less!

AlexeyAB commented 6 years ago

@szm2015 Can you show your C-code for mAP?

szm-R commented 6 years ago

Hello again, I attached the code. It's a Qt project (I use the ui for plotting). Here's an overall explanation:

In lines 57 to 112, there's a for loop on 11 thresholds (from 0 to 1). Inside this (line 63 to 108) is a loop over txt prediction files (which I also attached). In this loop the detections that have a score above the threshold are stored in cv::Rect objects (along with their scores and labels). then functions "FillEvaluationsMatrix" evaluates the predictions against the ground truth and fills a confusion matrix which is initialized at the beginning of threshold loop.

Outside the predictions loop (still inside threshold loop), the TP and FP values are computed using the confusion matrix in "finalEval" function (I count the total objects in every class in ground truth labels and use that for TP+FN value in recall denominator). This function computes precision and recall and saves them in a matrix (named PRpairs) that has 20 rows (number of classes) and 11 columns (number of thresholds), this way each class has PRpair for every threshold at the end of the loop.

Finally "ComputeAPs" function computes the APs of every class using the PRpairs calculated before and averages them to get mAP.

Detections.zip

YoloPRcurve.zip

MiZhangWhuer commented 6 years ago

@szm2015 How about the issues now? I also have the problem w.r.t PR curve. I wonder why the recall (x axis) is 60 rather than 100?

szm-R commented 6 years ago

Hello everyone, I haven't had the time to work on this matter for a while. just now I was checking voc_eval,py and came across these lines:

if ovmax > ovthresh: if not R['difficult'][jmax]: if not R['det'][jmax]: tp[d] = 1. R['det'][jmax] = 1 else: fp[d] = 1. else: fp[d] = 1.

It seems that detections are only counted if their ground truth is not "difficult" and also if not R['det'][jmax]. I haven't considered any of these in my code, though I have no idea what the second one is! I would appreciate any clarifications!

AlexeyAB commented 6 years ago

This code is taken from the repository of the author of Faster-RCNN detector: https://github.com/rbgirshick/py-faster-rcnn/blob/781a917b378dbfdedb45b6a56189a31982da1b43/lib/datasets/voc_eval.py#L177-L189

            overlaps = inters / uni
            ovmax = np.max(overlaps)
            jmax = np.argmax(overlaps)

        if ovmax > ovthresh:
            if not R['difficult'][jmax]:
                if not R['det'][jmax]:
                    tp[d] = 1.
                    R['det'][jmax] = 1
                else:
                    fp[d] = 1.
        else:
            fp[d] = 1.

Where:

So if ground-truth is difficult - then this ground-truth is not taken into account in Precision (neither in true-positive, nor in false-positive). If the the ovmax > 0.5 re-detected again - then this is false-positive fp.

10.1.1.157.5766.pdf

szm-R commented 6 years ago

Thank you @AlexeyAB for your complete explanation. I do something exactly like checking R['det'][jmax] in my code. I added the "difficult" checking part, but for some reason, the AP got even worse, I should check it more, but meanwhile, can you point me to the exact procedure of evaluating with voc_eval.py? Most important of which is the command line code to get the detection results as there are several validation functions in detector.c as far as I have understood.

MiZhangWhuer commented 6 years ago

Hi @szm2015 @AlexeyAB , I also plot the PR curve following the link https://github.com/D-X-Y/caffe-faster-rcnn/blob/dev/examples/FRCNN/calculate_voc_ap.py , and print the P-R values before mAP is computed.

  1. But I wonder why the precision value do not approximate to zero ? Attachments are my predicted bounding boxes file (PredBBoxes.txt) and corresponding ground truth bounding boxes file (GTBBoxes.txt). Note that all the bounding boxes is not the "difficult" type.
    1. The P-R curve seems correct when I adding the following code (see attachment: line 251 in calculate_ap.py.txt ) to normalize the recall values: rec = (rec - rec.min())/(rec.max() - rec.min())

I would be much appreciate if all of you can help me to solve the problems above. And hope further discussions on the P-R curve issues.

GTBBoxes.txt PredBBoxes.txt calculate_ap.py.txt

szm-R commented 6 years ago

Hi @MiZhangWhuer , as I myself am still struggling with this issues I can't be of much help to you! But hopefully, if I could solve it, I would share my results.

Now @AlexeyAB , I still don't know how to run voc_eval.py . My python knowledge is really rusty! I first created the detection results using the following command: ./darknet -i 0 detector valid cfg/voc.data cfg/tiny-yolo-voc.cfg models/tiny-yolo-voc.weights

(Note that I'm using pjreddie version of darknet)

Then I had my detection results in a folder named voc in results directory with the following format: className.txt

Now I added these lines at the end of voc_eval.py code to be able to run it (told you my python is rusty!!!):

print "Starting here"
detpath = "/path/to/results/voc/"
annopath = "/path/to/data/voc/VOC2007/Annotations/"
imagesetfile = "/path/to/data/voc/2007_test_FileNames.txt"
classname = "/path/to/data/voc/voc.names"
cachedir = "/path/to/data/voc/VOC2007/cache/"
ovthresh = 0.7
use_07_metric = True 
voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh, use_07_metric)

But detpath and others seem to be something other that simple addresses, because running the code gives me the following error:

Traceback (most recent call last):
  File "voc_eval.py", line 211, in <module>
    voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh, use_07_metric)
  File "voc_eval.py", line 137, in voc_eval
    with open(detfile, 'r') as f:
IOError: [Errno 21] Is a directory: '/home/szm/Work/Research/Models_and_Codes/darknet/darknet_GPU/results/voc/'

Can you please tell me how should I path these arguments to voc_eval.py?

szm-R commented 6 years ago

Hi everyone, I figured out my last question now I run the voc_eval.py by adding the following lines at the end:

detpath = '/path/to/darknet/darknet_GPU/results/voc/{}.txt'
annopath = '/path/to/darknet/darknet_GPU/data/voc/VOC2007/Annotations/{}.xml'
imagesetfile = '/path/to/darknet/darknet_GPU/data/voc/2007_test_FileNames.txt'
cachedir = '/path/to/darknet/darknet_GPU/data/voc/VOC2007/cache/'
ovthresh = 0.7
use_07_metric = True 
classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
for classname in classes:
    rec, prec, ap = voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh, use_07_metric)
    print "ClassName: %s AveragePrecision: %f" % (classname, ap)

Now I have a more fundamental question, In this code, we just hand the previously generated detection files to the evaluation function and in that we just calculate one pair of recall and precision for every class (as far as I have understood) and then calculate AP. Shouldn't there be some kind of a loop over different score thresholds (applied on the "confidence") to give us the precision-recall curve (to use for AP calculation)?

AlexeyAB commented 6 years ago

@szm2015 In your code use_07_metric = True. So if use_07_metric is true, voc_ap-function uses the VOC 11 point method - this is for mAP. You get from voc_eval function 3 params:

  1. rec
  2. prec
  3. ap https://github.com/AlexeyAB/darknet/blob/9c847647a1f3c257aaa8c0d4ec718e68888984ba/scripts/voc_eval.py#L198-L200 In the file reval_voc.py calculated mAP: https://github.com/AlexeyAB/darknet/blob/9c847647a1f3c257aaa8c0d4ec718e68888984ba/scripts/reval_voc.py#L71-L75

Also:


mAP calculation - 11 point method for PascalVOC: https://github.com/AlexeyAB/darknet/blob/9c847647a1f3c257aaa8c0d4ec718e68888984ba/scripts/voc_eval.py#L31-L45

AlexeyAB commented 6 years ago

@szm2015 @MiZhangWhuer I just added cmd-file for windows to calculate mAP. I got 56.6% for Tiny-Yolo 416x416 on PascalVOC 2007 test, that a little bit less than 57.1% that stated on the site: https://pjreddie.com/darknet/yolo/

If you use Windows and Python >= 3.5:



Mean AP = 0.5666
~~~~~~~~
Results:
0.629
0.725
0.487
0.427
0.212
0.678
0.678
0.709
0.353
0.546
0.581
0.628
0.710
0.697
0.604
0.283
0.559
0.524
0.712
0.590
0.567
~~~~~~~~

--------------------------------------------------------------
Results computed with the **unofficial** Python eval code.
Results should be very close to the official MATLAB eval code.
-- Thanks, The Management
--------------------------------------------------------------
szm-R commented 6 years ago

Thank you @AlexeyAB, after digging a little more into the code I found what I've been missing, the fact that predicted bounding boxes are ranked according to their confidence scores and then the recall/precision is computed for every one of these ranks. What I myself have been doing was to consider a number of thresholds (say 20) and then compute the PR pair for each one of them (by omitting the predictions with confidence scores below the threshold in each turn) and then calculating AP from these PR pairs. I still don't know why this way of computing AP gives such a drastically wrong result, but for now, I will stick to your code. Thanks again.

AlexeyAB commented 6 years ago

@szm2015 @MiZhangWhuer

I added C-code for calculation mAP (mean average precision) using Darknet for VOCdataset and any your custom dataset. Just use command: darknet.exe detector map data/voc.data tiny-yolo-voc.cfg tiny-yolo-voc.weights where in the voc.data file should be stated validation dataset valid=2007_test.txt

But my implementation shows lower value than reval_voc.py + voc_eval.py. If you will find error in my code and can fix it, let me know about it: https://github.com/AlexeyAB/darknet/blob/a1af57d8d60b50e8188f36b7f74752c8cc124177/src/detector.c#L498

I don't check difficult of ground truth as it does voc_eval.py, but as I see voc_label.py remove difficult objects already on labeling stage: https://github.com/AlexeyAB/darknet/blob/a1af57d8d60b50e8188f36b7f74752c8cc124177/scripts/voc_label.py#L37-L38


class = 0, name = aeroplane,     ap = 61.01 %
class = 1, name = bicycle,       ap = 71.18 %
class = 2, name = bird,          ap = 47.84 %
class = 3, name = boat,          ap = 40.23 %
class = 4, name = bottle,        ap = 20.88 %
class = 5, name = bus,   ap = 67.68 %
class = 6, name = car,   ap = 66.21 %
class = 7, name = cat,   ap = 70.46 %
class = 8, name = chair,         ap = 33.77 %
class = 9, name = cow,   ap = 54.15 %
class = 10, name = diningtable,          ap = 55.45 %
class = 11, name = dog,          ap = 62.47 %
class = 12, name = horse,        ap = 71.24 %
class = 13, name = motorbike,    ap = 68.72 %
class = 14, name = person,       ap = 59.28 %
class = 15, name = pottedplant,          ap = 27.54 %
class = 16, name = sheep,        ap = 54.45 %
class = 17, name = sofa,         ap = 50.07 %
class = 18, name = train,        ap = 70.83 %
class = 19, name = tvmonitor,    ap = 58.63 %

 mean average precision (mAP) = 0.556050, or 55.61 %
Total Detection Time: 56.000000 Seconds

So on the site and in the article stated 78.6% page-4 table-3: https://arxiv.org/pdf/1612.08242v1.pdf Lower value because:

class = 0, name = aeroplane,     ap = 80.84 %
class = 1, name = bicycle,       ap = 84.10 %
class = 2, name = bird,          ap = 75.03 %
class = 3, name = boat,          ap = 65.30 %
class = 4, name = bottle,        ap = 55.22 %
class = 5, name = bus,   ap = 83.66 %
class = 6, name = car,   ap = 84.53 %
class = 7, name = cat,   ap = 88.20 %
class = 8, name = chair,         ap = 58.35 %
class = 9, name = cow,   ap = 80.53 %
class = 10, name = diningtable,          ap = 69.81 %
class = 11, name = dog,          ap = 84.07 %
class = 12, name = horse,        ap = 86.17 %
class = 13, name = motorbike,    ap = 83.33 %
class = 14, name = person,       ap = 78.44 %
class = 15, name = pottedplant,          ap = 50.86 %
class = 16, name = sheep,        ap = 77.36 %
class = 17, name = sofa,         ap = 71.74 %
class = 18, name = train,        ap = 82.96 %
class = 19, name = tvmonitor,    ap = 74.95 %

 mean average precision (mAP) = 0.757728, or 75.77 %
Total Detection Time: 214.000000 Seconds
szm-R commented 6 years ago

I think you should still consider difficult objects, because there may be cases where the model detects a difficult object and as that object is not listed as ground truth by voc_label.py, the code will count it as a false positive when it should not, and this will decrease precision. I think this explains the little difference between the mAP of python code and the C code (The python code gets the ground truth directly from xml files)

On Thu, Feb 15, 2018 at 4:28 PM, Alexey notifications@github.com wrote:

@szm2015 https://github.com/szm2015 @MiZhangWhuer https://github.com/mizhangwhuer

I added C-code for calculation mAP (mean average precision) using Darknet for VOCdataset and any your custom dataset. Just use command: darknet.exe detector map data/voc.data tiny-yolo-voc.cfg tiny-yolo-voc.weights where in the voc.data file should be stated validation dataset valid=2007_test.txt

But my implementation shows lower value than reval_voc.py + voc_eval.py. If you will find error in my code and can fix it, let me know about it: https://github.com/AlexeyAB/darknet/blob/a1af57d8d60b50e8188f36b7f74752 c8cc124177/src/detector.c#L498

I don't check difficult of ground truth as it does voc_eval.py, but as I see voc_label.py remove difficult objects already on labeling stage: https://github.com/AlexeyAB/darknet/blob/a1af57d8d60b50e8188f36b7f74752 c8cc124177/scripts/voc_label.py#L37-L38


class = 0, name = aeroplane, ap = 61.01 % class = 1, name = bicycle, ap = 71.18 % class = 2, name = bird, ap = 47.84 % class = 3, name = boat, ap = 40.23 % class = 4, name = bottle, ap = 20.88 % class = 5, name = bus, ap = 67.68 % class = 6, name = car, ap = 66.21 % class = 7, name = cat, ap = 70.46 % class = 8, name = chair, ap = 33.77 % class = 9, name = cow, ap = 54.15 % class = 10, name = diningtable, ap = 55.45 % class = 11, name = dog, ap = 62.47 % class = 12, name = horse, ap = 71.24 % class = 13, name = motorbike, ap = 68.72 % class = 14, name = person, ap = 59.28 % class = 15, name = pottedplant, ap = 27.54 % class = 16, name = sheep, ap = 54.45 % class = 17, name = sofa, ap = 50.07 % class = 18, name = train, ap = 70.83 % class = 19, name = tvmonitor, ap = 58.63 %

mean average precision (mAP) = 0.556050, or 55.61 % Total Detection Time: 56.000000 Seconds


-

For darknet.exe detector map data/voc.data yolo-voc.cfg yolo-voc.weights width=544 height=544 Got mAP = 75.77%

So on the site and in the article stated 78.6% page-4 table-3: https://arxiv.org/pdf/1612.08242v1.pdf Lower value because:

class = 0, name = aeroplane, ap = 80.84 % class = 1, name = bicycle, ap = 84.10 % class = 2, name = bird, ap = 75.03 % class = 3, name = boat, ap = 65.30 % class = 4, name = bottle, ap = 55.22 % class = 5, name = bus, ap = 83.66 % class = 6, name = car, ap = 84.53 % class = 7, name = cat, ap = 88.20 % class = 8, name = chair, ap = 58.35 % class = 9, name = cow, ap = 80.53 % class = 10, name = diningtable, ap = 69.81 % class = 11, name = dog, ap = 84.07 % class = 12, name = horse, ap = 86.17 % class = 13, name = motorbike, ap = 83.33 % class = 14, name = person, ap = 78.44 % class = 15, name = pottedplant, ap = 50.86 % class = 16, name = sheep, ap = 77.36 % class = 17, name = sofa, ap = 71.74 % class = 18, name = train, ap = 82.96 % class = 19, name = tvmonitor, ap = 74.95 %

mean average precision (mAP) = 0.757728, or 75.77 % Total Detection Time: 214.000000 Seconds

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexeyAB/darknet/issues/350#issuecomment-365919925, or mute the thread https://github.com/notifications/unsubscribe-auth/APaJX0Dugg0eviXFI04ZbK0D3vwfGv5Eks5tVCoOgaJpZM4Rs4Fh .

AlexeyAB commented 6 years ago

@szm2015 Yes, I think it can influence.

Maybe I'll add a separate Python-script voc_eval_difficult.py that creates a txt-files for Yolo with the labels (coordinates) of difficult objects from the XML-files of PascalVOC dataset. And will use these txt-files to remove difficult objects from calculating of TP and FP.

AlexeyAB commented 6 years ago
  1. I added python script to get a list of images and labels with Difficult objects that generates difficult_2007_test.txt file: https://github.com/AlexeyAB/darknet/blob/master/scripts/voc_label_difficult.py This file should be set here (without #): https://github.com/AlexeyAB/darknet/blob/65bff2683bdffe7ec82eacd8144c70c09d19c88d/build/darknet/x64/data/voc.data#L4

  1. Then darknet.exe detector map data/voc.data tiny-yolo-voc.cfg tiny-yolo-voc.weights gives 56.21%, (but reval_voc.py and voc_eval.py gives 56.6%, diff = 0.39):
class = 0, name = aeroplane,     ap = 61.05 %
class = 1, name = bicycle,       ap = 71.58 %
class = 2, name = bird,          ap = 48.26 %
class = 3, name = boat,          ap = 40.61 %
class = 4, name = bottle,        ap = 20.92 %
class = 5, name = bus,   ap = 68.13 %
class = 6, name = car,   ap = 66.48 %
class = 7, name = cat,   ap = 70.46 %
class = 8, name = chair,         ap = 35.08 %
class = 9, name = cow,   ap = 55.10 %
class = 10, name = diningtable,          ap = 58.06 %
class = 11, name = dog,          ap = 62.59 %
class = 12, name = horse,        ap = 71.42 %
class = 13, name = motorbike,    ap = 69.23 %
class = 14, name = person,       ap = 59.74 %
class = 15, name = pottedplant,          ap = 27.80 %
class = 16, name = sheep,        ap = 55.32 %
class = 17, name = sofa,         ap = 52.50 %
class = 18, name = train,        ap = 70.84 %
class = 19, name = tvmonitor,    ap = 59.13 %

 mean average precision (mAP) = 0.562140, or 56.21 %

  1. darknet.exe detector map data/voc.data yolo-voc.cfg yolo-voc.weights gives 76.94%, (but reval_voc.py and voc_eval.py gives 77.1%, diff = 0.16):

    class = 0, name = aeroplane,     ap = 80.89 %
    class = 1, name = bicycle,       ap = 84.36 %
    class = 2, name = bird,          ap = 76.10 %
    class = 3, name = boat,          ap = 66.57 %
    class = 4, name = bottle,        ap = 55.50 %
    class = 5, name = bus,   ap = 84.11 %
    class = 6, name = car,   ap = 85.80 %
    class = 7, name = cat,   ap = 88.31 %
    class = 8, name = chair,         ap = 61.29 %
    class = 9, name = cow,   ap = 82.67 %
    class = 10, name = diningtable,          ap = 72.38 %
    class = 11, name = dog,          ap = 84.46 %
    class = 12, name = horse,        ap = 86.54 %
    class = 13, name = motorbike,    ap = 83.92 %
    class = 14, name = person,       ap = 79.27 %
    class = 15, name = pottedplant,          ap = 51.84 %
    class = 16, name = sheep,        ap = 78.71 %
    class = 17, name = sofa,         ap = 75.63 %
    class = 18, name = train,        ap = 83.19 %
    class = 19, name = tvmonitor,    ap = 77.15 %
    
    mean average precision (mAP) = 0.769353, or 76.94 %

Now we somewhere lose 0.16 - 0.39 % of mAP :)