david-vazquez / mcv-m5

Master in Computer Vision - M5 Visual recognition
13 stars 51 forks source link

Eval means inputparams #26

Open bluque opened 7 years ago

bluque commented 7 years ago

This script outputs the metrics of each batch of 128 images, but I have added the computation of the mean of these metrics in order to have a global evaluation of the model.

I also propose an improvement over #25 where the model and dataset names are passed as arguments when calling the function. This way, we don't need to modify the script for every new evaluation. python eval_detection_fscore.py model_name dataset_name weights_file path_to_images

The only reason why I changed the way the model is built when distinguishing between yolo and tiny yolo is so it's easier to add more models in a future (just add another elif model_name == ...), but it's not a relevant change, the result is the same.

lluisgomez commented 7 years ago

@bluque Thanks for the pull request!

Overall I like the changes you propose for the input arguments of model and dataset names.

However I do not see the point of the "averaged metrics". See, in the original code what is printed on lines from 125 to 128 is the "running" precision, recall, and f-score. Not the metrics for each batch.

For example the variable "ok" is defined and set to zero on line 66, outside the loop, and never set to zero again. We only increase its value every time we find a correct detection.

The same for variables "total_pred" and "total_true".

So the metrics that are shown are the metrics for all the images evaluated so far. Thus when the script finishes they are the metrics for the whole dataset. right?

On the other hand, be careful with one thing: an averaged metric (per batch) as you propose is not always meaningful. Imagine we evaluate only two "batches" of 128 images each. In the first batch there is only one object in one of the images (all other images contain no objects) and the model we are evaluating misses it, so the recall for this "batch" is zero. Then imagine In the second batch there are 200 objects and the model detects correctly all of them, so the recall for the second batch is 100%. If you do the mean of these two recall values you get a final recall of 50% while the model had correctly detected 200 objects out of 201 :) so final recall must be 99.5%. Do you see the point? We must calculate the average over the total objects in the ground truth, not over the total images or batches.

Please, let me know if this is clear ... I've double checked the code and I think it's correct as it is. Anyway, it's always good to check things that are not clear, and be sure they are correct.

Also I'm open to change the code for example to print "Running precision" instead of "Precision" etc. and then print the final precision at the end when the main loop is finished. Maybe this helps to avoid confusions.

bluque commented 7 years ago

Yes, you are right! I didn't check with detail how the metrics where computed because I thought they were related to the batch. In any case, it is true that the average of these metrics wouldn't be precise anyway. I will make the modifications you propose :)