dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.16k stars 551 forks source link

Blocking as a feature for scoring #1103

Open fgregg opened 2 years ago

fgregg commented 2 years ago

Right now, blocking and scoring are two distinct phases.

All the information about how two records came to be blocked together is unused by the scorer. This is a bit silly, as the fact that two records are blocked together by multiple predicates could be a pretty good indicator of co-reference.

I'm not really clear what the best way to take advantage of blocking information in scoring is though.

a few ideas:

  1. ensemble model. Treat each each blocking predicate as a classifier, and put them in an ensemble with the scorer
  2. blocking as feature: add dummy features indicating which predicate rules are cover a pair. these features get fed into the scorer

In both cases, i'm not quite sure how to set up the training.

NickCrews commented 2 years ago

Splink uses something very similar to method 2. See https://youtu.be/msz3T741KQI?t=2035 for a nice way of how they think about the different "types" of comparisons that can happen. The whole video had some other great thoughts and visualizations in there too I thought.