First of all, a quick introduction: I am a dedupe user (I also experimented with splink before choosing the former over it) and I must say that I totally agree with the description/analysis of those libraries made here.
The idea of starting a new project building on their strengths and learning from their mistakes is truly awesome!
That said, for a few months now I have been trying to bend dedupe to my needs, which mostly means achieving incremental clustering on a daily increasing collection of street addresses.
My question is: are there plans to support incremental clustering starting from a base (large) data set and continuously add new records (without reanalysing everything from scratch)?
The actions I would like to perform on a new record are:
retrieve addresses similar to the new one in O(1) time,
actually add the incoming record to that cluster.
I may be able to help with the development, although I do not have a deep understanding of the subject.
(This request would better fit the "Discussions" session of GitHub, however, as it does not seem to be used in this repository, I am posting this here.)
Hi!
First of all, a quick introduction: I am a
dedupe
user (I also experimented withsplink
before choosing the former over it) and I must say that I totally agree with the description/analysis of those libraries made here. The idea of starting a new project building on their strengths and learning from their mistakes is truly awesome!That said, for a few months now I have been trying to bend
dedupe
to my needs, which mostly means achieving incremental clustering on a daily increasing collection of street addresses. My question is: are there plans to support incremental clustering starting from a base (large) data set and continuously add new records (without reanalysing everything from scratch)?The actions I would like to perform on a new record are:
O(1)
time,I may be able to help with the development, although I do not have a deep understanding of the subject.
(This request would better fit the "Discussions" session of GitHub, however, as it does not seem to be used in this repository, I am posting this here.)