dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.12k stars 551 forks source link

Explanation on the entity matches for Identity resolution #1178

Open KarthikeyanTWL opened 9 months ago

KarthikeyanTWL commented 9 months ago

Hi Folks, I have a few questions on dedupe,

  1. Dedupe provides a confidence score for each match, but can it also provide the explanation of why the matching is done? eg: "These two records are matched because the First name and the phone number are the same", or something like that?

  2. Given the new data that comes in, will it match existing identities? eg: consider the data is coming from kafka stream and the identity resolution should be done in real time for the new data.

  3. Is there an enterprise option available? If yes, what are the additional things that will be provided?

Thanks in advance!

ArVar commented 3 months ago

To point 1: Theoretically it should be possible, since the pairing is based on hierarchical clustering and linear logistic regression. But the problem would be the potential vast amount of predicates which are actually learned. As far as I understood, the hierarchical tree is build upon the weights, learned for the predicates. This would make it hard to derive a real explainability. But I might be wrong. Nevertheless, such a feature, although probably very costly, would be very nice. 👍

To point 2: That is what the (Static)RecordLink- and (Static)Gazetteer-Part is for. (The hard part will be, to maintain the groundtruth somewhere ;-) )

See also: