Blocking as a feature for scoring

dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

MIT License

4.16k stars 551 forks source link

Right now, blocking and scoring are two distinct phases.

All the information about how two records came to be blocked together is unused by the scorer. This is a bit silly, as the fact that two records are blocked together by multiple predicates could be a pretty good indicator of co-reference.

I'm not really clear what the best way to take advantage of blocking information in scoring is though.

a few ideas:

ensemble model. Treat each each blocking predicate as a classifier, and put them in an ensemble with the scorer
blocking as feature: add dummy features indicating which predicate rules are cover a pair. these features get fed into the scorer

In both cases, i'm not quite sure how to set up the training.

dedupeio / dedupe

Blocking as a feature for scoring #1103