Develop evaluation methods for matching models

ecsalomon commented 7 years ago

We will want to compare, select, and evaluate matching models. This requires generating and storing metrics (see https://github.com/dssg/pgdedupe/issues/20 for some possibilities) and, perhaps comparing Type I and Type II error rates on labeled pairs not used in the training data (see #20).

This will likely entail storing metrics in a metrics table and a notebook/methods/workflow for conducting comparisons and evaluations.

ecsalomon commented 7 years ago

Many of these metrics with have cluster score thresholds (see #26), so the metrics table should be similar in shape to the triage results evaluations table.

ecsalomon commented 7 years ago

Metrics:

Number of clusters @ threshold
Number of unmatched records @ threshold
Number of exact matches
Average size of cluster @ threshold
Maximum size of cluster @ threshold
Percentage of clusters of size 2 @ threshold
Number of blocks
Average size of blocks
Maximum size of block
Minimum size of block
Precision and recall on holdout labels @ threshold (see #20)

thcrock commented 6 years ago

@nanounanue here are ideas for metrics

thcrock commented 6 years ago

From Joe: recall number of unique persons identified This is one way to check whether the model is not matching enough people. e.g. If we don't match anyone -- we assume every event is for a separate person -- we'll probably get a ridiculous number of people in the data. We might even get more people than live in the jurisidiction [] Measure of variation on the number of persons identified [] maximum number of person events To understand what I mean, think in the extreme, where we say all records belong to a single person. That person would have more events than is reasonable, e.g. 1 person has 10,000 jail bookings. This can help provide a check on the quality of the matches [] Number of times the user says the model made a mistake

dssg / matching-tool

Develop evaluation methods for matching models #23