How do we determine "correctness" with sliding window?

jonfroehlich commented 5 years ago

How do we determine "correctness" with our sliding window algorithm? That is, how much of the target artifact must exist in the sliding window to say that a classification is correct or incorrect? Perhaps another way of stating this is how much pixel overlap must there be between the sliding window crop and our target to consider it part of the target? We have discussed this before in person but I can't recall what our decision was or its rationale.

galenweld commented 5 years ago

Had a good discussion on this topic in person today, but I wanted to post here for continuity, and, importantly, to provide updates.

Today, I completely rewrote the 'scoring' system for computing correct and incorrect labels given a set of ground truths and a set of predictions. This change makes two significant improvements:

Once a predicted label has been tagged 'correct' based on a nearby ground truth label, the corresponding ground truth label is removed from set of ground truth labels, so that no other predictions can be marked correct using it. This prevents "double-counting" of corrects, where a single prediction near two ground truth labels can get counted twice, or vice versa. In the case of multiple ground truth labels being near to a prediction, the closest one is used. Technically speaking, the algorithm I use to assign correct predictions to ground truth labels is not guaranteed to converge to a global optimum, however, this seems to be a slim possibility, and furthermore the failure mode is that we slightly undersell our performance.
The "correctness distance" threshold from a prediction to a ground truth label now is set "dynamically," so that it is proportional to the depth to that ground truth label, using the same algorithm as the depth-proportioned cropping. This distance is adjustable using a variable tuning parameter which I've been calling the correctness-radius, which is set as a fraction of the crop size. I've been (totally arbitrarily) using .9 – ie, if a curb ramp is located at a depth such that the crop size prediction algorithm assigns it a "width" of 100 pixels, I will mark a prediction as correct if it is within .9*100 pixels of the curb ramp's label. This dynamic thresholding can be overridden using a static distance threshold instead, if desired.

In addition to the above correctness-radius threshold, there is another parameter which can be tuned - the clipping-value. This value (which defaults to None) tells the system to ignore predictions whose 'strength' is less than this value. For example: The model outputs, for each crop, a vector of length 5, where the 0th position corresponds to a missing curb ramp, the 1st position to a null crop, etc for all 5 feature types. The predicted label is assigned to this vector by computing the argmax, eg. [4.8, 0.07, 2.3, -1.23, 1.1] would be assigned label 0 - a missing curb ramp. The clipping value simply ignores predictions whose maximum value is less, so, in this example, a clipping value of 4 would keep this prediction, and a clipping value of 5 would ignore it.

To summarize, the tunable parameters are: the correctness radius, a fraction of the size of the ground truth feature, and the clipping value, used to ignore predictions with 'weak' strength.

Using the above system, I have computed results with our latest model (resnet_extended_18, using both geo- and positional features, and trained on the sliding_window dataset) on the ground truth dataset of ~1000 labels created by @infrared0 and myself.

Using the clipping value parameter, it is possible to vary the trade-off between precision and recall by making the model more-or-less 'sensitive'. Doing so, I created a quick-and-dirty precision-recall curve for all feature types, as well as overall, and I'm quite pleased – while far from perfect, this is better than I expected us to do!

In particular, notice that we do pretty well on curb ramps, with 62.83% precision at 63.62% recall comparing favorably to Project Sidewalk's human labelers' ~71% precision and ~63% recall. This is, of course, an unfair comparison because I'm only including our best feature type here, but it's not unfair to say that on the labeling task, we approach human performance on curb ramps, with further to go for other feature types. I'd say that's pretty good!

The above graph, of course, is low-resolution and was cobbled together in a hurry – I wanted to go home tonight, but I'll sample more densely tomorrow and put together a nicer graph.

jonfroehlich commented 5 years ago

I want to make sure that this is properly captured in the ASSETS paper so leaving open for now.

ProjectSidewalk / sidewalk-cv-assets19

How do we determine "correctness" with sliding window? #23