Closed bklare-zz closed 11 years ago
Totally agree. We currently have a few transforms in the grammar for ad hoc visualization. For example:
$ br -algorithm "Open+Cascade(FrontalFace)+ASEFEyes+Draw+Show" -enroll data/MEDS/img/S001-01-t10_01.jpg
There is also the EditTransform
for manually editing landmarks but I think it currently only supports deleting them :) You can find all the relevant code in sdk/plugins/draw.cpp and sdk/plugins/misc.cpp. Currently no real GUI support otherwise. If you're ready to start implementing more quantitative feedback, let me know and I can suggest a couple ways to implement it.
I'm finally ready to take on this ticket. The approach I'm proposing is to introduce br -evalDetection
and br -evalLandmarking
that act similarly to br -evalClassification
and br -evalRegression
in that they take an a file of ground truth detections/landmarkings and a file of computed detection/landmarkings and print the error statistics to the terminal. Sound good?
@jklontz Great to hear. I agree with the proposed interface.
I suppose the next decision will be the format with which to represent annotated faces and landmark sets, where annotations could be manual/ground truth, or automated.
As far as formats go, we have free support for all the existing Gallery formats (including .xml and .csv). If we need any other formats we can simply add new plugins. I'll take a look at what you do right now in your java code and see if it works out of the box, otherwise I'll see if I can find a way to support it, or in the worst case change it.
Want to weigh in on what error metrics to support?
Face detector metrics: -Percent of faces correctly detected (requires an overlap parameter) -Number of false positives Notes: Classic Type I and Type II error analysis is difficult here b/c the every single sliding window location is essentially a classification stat. Obviously this dilutes the results. Thus, we can still measure Type II error (i.e., rate of false negatives), but the Type I error (i..e, rate of false positives) is instead reported as the "number" (not rate) of false positives. Overlap parameter is what percentage two bounding boxes need to overlap to be considered a match (e.g., 75%).
Landmark detector metrics: -Total IPD normalized mean landmark distance -Per landmark, IPD normalized mean landmark distance Notes: This is also referred to as the 'bi-ocular' distance metric. See Section 5.1 from the paper below for a further description.
Sun, Yi, Xiaogang Wang, and Xiaoou Tang. "Deep Convolutional Network Cascade for Facial Point Detection." IEEE CVPR 2013 http://www.ee.cuhk.edu.hk/~xgwang/papers/sunWTcvpr13.pdf
On Tue, Jul 23, 2013 at 12:37 PM, jklontz notifications@github.com wrote:
As far as formats go, we have free support for all the existing Gallery formats (including .xml and .csv). If we need any other formats we can simply add new plugins. I'll take a look at what you do right now in your java code and see if it works out of the box, otherwise I'll see if I can find a way to support it, or in the worst case change it.
Want to weigh in on what error metrics to support?
— Reply to this email directly or view it on GitHubhttps://github.com/biometrics/openbr/issues/9#issuecomment-21427493 .
What do you think about evaluation detection performance using a precision/recall curve
This is a good suggestion. Precision/recall is not (generally) sensitive to the fact that each sliding window location is a classification instance b/c it only uses TP, FP, and FN (i.e,. it does not use TN). My only downside is that I often find precision/recall hard to read from a semantic interpretation standpoint, but this may just be may.
You inadvertently bring up a good point about the that our evaluation should support a continuous/thresholdable output. The popular FDDB benchmark uses a modified ROC curve, where instead of false positive rate, it is total number of false positives (again, b/c each location is a classification point). Below is an example of this plot.
Perhaps we could use precision/recall and the modified ROC plot as our metrics for (face) detection.
It sounds like what we want, at least for evaluating detection performance, is a setup similar to how we evaluate recognition:
.csv
file for a detector, and print out a few numeric statistics to the terminal.The three plots that come to mind immediately are: 1/2. FDDB's false positive count vs. true positive rate (both continuous and discrete versions).
Sound good?
Yes, sounds good. Down the road, we can use the CSV to generate precision recall curves should we want to present these stats later. On Jul 25, 2013 10:31 AM, "jklontz" notifications@github.com wrote:
It sounds like what we want, at least for evaluating detection performance, is a setup similar to how we evaluate recognition:
- Generate an intermediate results .csv file for a detector, and print out a few numeric statistics to the terminal.
- Plot one or more results files together to compare detector performance.
The three plots that come to mind immediately are: 1/2. FDDB's false positive count vs. true positive rate (both continuous and discrete versions).
- Histogram of detection bounding box overlap percentage
Sound good?
— Reply to this email directly or view it on GitHubhttps://github.com/biometrics/openbr/issues/9#issuecomment-21557725 .
Oh sorry, forgot about that curve -- I'll add it in too.
The detection half of this ticket has been implemented and described on the mailing list.
@jklontz Thanks for getting this working. Everything looks really good so far.
Per our discussion, we will also need to ingest FDDB files in the 'ellipse' format. In this format, a face entry is specified as: major_axis_radius minor_axis_radius angle center_x center_y confidence
To maintain consistency with our other (Java) code, we will want to convert this to a rect in the following manner: rect.x = center_x rect.y = center_y rect.width = minor_axis * 2 rect.height = rect.width
If anyone has an issue with this conversation just let me know. It will not be difficult to change our other code if this is the case.
@bklare Can you weigh in a bit more on how you'd like to see the landmarking accuracy plots? I know we talked about plotting % of data samples with error under a certain fraction of IPD. My concern is that we have too many landmarks to allow for meaningful comparison between different landmarks and different algorithms on the same figure. I think we can get all the important information on a single plot using boxplots, assigning different colors to different algorithms, and faceting by landmark location. Also, do you envision each landmark to be individually named or simply numbered from 1 to n?
@jklontz Regarding the plots for percent of samples with accuracy under a certain (normalized) distance, I think this should be reflective of the mean distance for all landmarks. This way there is only one plot. While one critique on this approach is that it may be dirtied by have too many landmarks, I would think such cases mean that the underlying protocol needs to be changed (e.g., conduct experiments on smaller subsets of more meaningful landmarks). Does what I am saying make sense, and, if so, do you agree?
Regarding names, I think simple numbering is fine. The only case where I could see this changing is if we wanted to use anthropometric landmark naming. That can be dealt with at a later date should the need arise.
On Wed, Aug 7, 2013 at 3:08 PM, jklontz notifications@github.com wrote:
@bklare https://github.com/bklare Can you weigh in a bit more on how you'd like to see the landmarking accuracy plots? I know we talked about plotting % of data samples with error under a certain fraction of IPD. My concern is that we have too many landmarks to allow for meaningful comparison between different landmarks and different algorithms on the same figure. I think we can get all the important information on a single plot using boxplots http://docs.ggplot2.org/current/geom_boxplot.html, assigning different colors to different algorithms, and faceting by landmark location. Also, do you envision each landmark to be individually named or simply numbered from 1 to n?
— Reply to this email directly or view it on GitHubhttps://github.com/biometrics/openbr/issues/9#issuecomment-22275729 .
@bklare I hadn't considered this mean approach, though I agree with your justification. I'll implement both this and the boxplot scheme I proposed.
Great. And good call on the box plot.
On Wed, Aug 7, 2013 at 5:16 PM, jklontz notifications@github.com wrote:
@bklare https://github.com/bklare I hadn't considered this mean approach, though I agree with your justification. I'll implement both this and the boxplot scheme I proposed.
— Reply to this email directly or view it on GitHubhttps://github.com/biometrics/openbr/issues/9#issuecomment-22284721 .
Finished implementing in 4c3ee28f33d30b8d6ab596065c492f4f11fb49fd, and described in https://groups.google.com/forum/#!topic/openbr-dev/dOdkcD4YL0E . Finally done with this :)
It would be useful to be able to analyze face and landmark detection results in OpenBR. At the least, painting bounding boxes around detected regions could help for qualitatively comparing different detection methods and parameters. More ambitiously, it would be good to provide quantitative results regarding the accuracy of detection (based on ground truth information provided).
Perhaps some of this functionality already exists. If not, then I would be interested in developing such functionality (unless someone else wants to).
@jklontz - Do you have any issues with such tools being built into OpenBR? Detection and alignment are the currently the biggest bottleneck in terms of accuracy for this system. Without such tools it will be difficult to improve these stages.