dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.15k stars 551 forks source link

consider a more sklearn like, pipeline approach #1207

Open fgregg opened 2 months ago

fgregg commented 2 months ago
  1. break out all the active learning bits into a separate class or multiple separate classes

  2. train a blocking model, using the familiar fit_transform syntax. this is a separate class that emits a stream of pairs. (is this something that could really fit into the sklearn pattern)

  3. train a classification model using fit_transform., this takes in a stream of pairs and emits a stream of classification decisions

actually, this all would work quite well.

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

fgregg commented 2 months ago

we can think of blocking as related to clustering, and use that as inspo.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans