NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Incremental clustering #36

Closed lmores closed 7 months ago

lmores commented 7 months ago

Hi!

First of all, a quick introduction: I am a dedupe user (I also experimented with splink before choosing the former over it) and I must say that I totally agree with the description/analysis of those libraries made here. The idea of starting a new project building on their strengths and learning from their mistakes is truly awesome!

That said, for a few months now I have been trying to bend dedupe to my needs, which mostly means achieving incremental clustering on a daily increasing collection of street addresses. My question is: are there plans to support incremental clustering starting from a base (large) data set and continuously add new records (without reanalysing everything from scratch)?

The actions I would like to perform on a new record are:

  1. retrieve addresses similar to the new one in O(1) time,
  2. actually add the incoming record to that cluster.

I may be able to help with the development, although I do not have a deep understanding of the subject.

(This request would better fit the "Discussions" session of GitHub, however, as it does not seem to be used in this repository, I am posting this here.)

NickCrews commented 7 months ago

Moving to https://github.com/NickCrews/mismo/discussions/38 😄