J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
961 stars 153 forks source link

[Feature Req/Question] EM-algorithm for frequency based estimates #19

Open leduke2000 opened 8 years ago

leduke2000 commented 8 years ago

Hi,

Just wondering whether the EM-algorithm for frequency based estimates, or any other algorithm taking into account value frequencies is/will be included in the package?

Thanks!!

sbagroy986 commented 7 years ago

@leduke2000 did you manage to find an answer to your question? I'm also looking for implementations of frequency-based estimation (and then linkage) algorithms. If not, did you manage to find any other implementation which does so?

@J535D165 Any info on this would really be appreciated. Thanks.

J535D165 commented 7 years ago

Hello,

Frequency based estimates are complicated. They often depend on the classifier and comparison method. Are you guys looking for a specific frequency based algorithm? Like the one proposed by Fellegi and Sunter or the modified version by Winkler?

The package contains a function named value_occurence. It can be used to add a column with the number of occurrences of the attribute. I'm not sure if it works, but maybe the following increases the performance.

import recordlinkage as rl

from recordlinkage.standardise import value_occurence
from recordlinkage.datasets import load_febrl4

A,B = load_febrl4()
A['freq_given_name'] = value_occurence(A['given_name'])

cl_index = rl.Pairs(A, B)
pairs = cl_index.block('surname')

cl_comp = rl.Compare(pairs, A, B)
x = cl_comp.string('given_name', 'given_name', ...)

y = A.loc[pairs.get_level_values(0), 'freq_given_name']

z = x * y # maybe take the inverse of y?

# use x*y is your logistic regression classifier

I'm not sure if there are any packages with frequency based estimates.

sbagroy986 commented 7 years ago

Thanks @J535D165. What I'm basically looking for is to somehow compute the probability of a link between two records, based on all the other links. For example, if A is matched to B1 but is also matched to B2,B3..,B10, the probability of matching A to Bi should account for that fact. It looks like frequency based matching would automatically account for those.

I'm not looking for any algorithm in specific, just need one to do what I said above.

sbagroy986 commented 7 years ago

Hi @J535D165 , I came across a PDF of your thesis where you have some Python code provided in the appendix. It looks like there's a section there (D-2) where you've provided some code for an example frequency based estimation; is there some way to use that with your library here?

J535D165 commented 7 years ago

Hey @sbagroy986, I think you are struggling with a one-to-one matching problem in a bipartite graph. This is known as the Assignment problem.

Oke, record a1 is matched with b1,b2,b3,..., b10. Most of the time, the comparison vectors (a1,b1), (a1,b2),..., (a1,b10) aren't identical. (If they are, consider the use of more comparison variables or the method described above.) Because the comparison vectors are not identical, they result in different matching probabilities. These probabilities can be used for one-to-one matching.

import pandas as pd

mindex = pd.MultiIndex.from_arrays(
    [[1,1,1,1,1,1,1,1,1,1],[1,2,3,4,5,6,7,8,9,10]]
)

probs = pd.Series(
    [0.5, 0.9, 0.1, 0.2, 0.5, 0.7, 0.8, 0.85, 0.6, 0.7], 
    index=mindex
)

print (probs)

# milestone v0.9
def one_to_one_matching(weights):

    w_sort = weights.sort_values(ascending=False)

    dup_level_0 = w_sort.index.get_level_values(0).duplicated(keep='first')
    dup_level_1 = w_sort.index.get_level_values(1).duplicated(keep='first')

    return w_sort[(~dup_level_0 & ~dup_level_1)]

print one_to_one_matching(probs)

If this doesn't help you, can you given me a bit more information about your program? A few comparison vectors for example and the classifier you are using.

sbagroy986 commented 7 years ago

Hi @J535D165 , sorry I think I wasn't clear enough before. Let me try to explain my problem again. I have two different sets of records, say A and B. I would like to probabilistically link records from A to records in B, but I would like to assess a sense of "certainty" for each matched pair. I'm inclined to just use the probabilities of positive link given by the Fellegi-Sunter algorithm, but it doesn't work right out of the box for the reasons below.

Consider (fabricated) examples a1 from A and b1,b2,3 from B:

a1={ 'first_name': 'John', 'middle_name': 'Allan', 'last_name': 'Smith', 'date_of_birth': '01/02/1980', 'country': 'USA'} b1={ 'first_name': 'John', 'middle_name_start': 'E', 'last_name': 'Smith', 'country': ''} b2={ 'first_name': 'John', 'middle_name': 'A', 'last_name': 'Smith', 'country': 'Canada'} b3={ 'first_name': 'John', 'middle_name': 'A', 'last_name': 'Smith', 'country': 'USA'}

Let's assume Fellegi-Sunter assigns a1-b3 a positive link probability p1.

Now, let's consider the following set of examples: a1={ 'first_name': 'John', 'middle_name': 'Allan', 'last_name': 'Smith', 'date_of_birth': '01/02/1980', 'country': 'USA'} b1={ 'first_name': 'John', 'middle_name_start': 'E', 'last_name': 'Smith', 'country': ''} b2={ 'first_name': 'John', 'middle_name': 'A', 'last_name': 'Smith', 'country': 'Canada'} b3={ 'first_name': 'John', 'middle_name': 'A', 'last_name': 'Smith', 'country': 'USA'} b4={ 'first_name': 'John', 'middle_name': 'A', 'last_name': 'Smith', 'country': 'USA'}

Given that b3 and b4 are identical, Fellegi-Sunter would assign a1-b3 and a1-b4 both the same positive link probabilities , say p2.

According to my (limited) understanding of the algorithm, p1 = p2 since Fellegi-Sunter doesn't account for the distribution of values in the fields themselves. However, clearly, there is more uncertainty in the second case where a1 could be linked to either b3 or b4, so p2 should be lower than p1 (p2< p1).

In general, if there is more than one possible link, I would like to incorporate that when assessing my "certainty" measure, which Fellegi-Sunter doesn't do by default. Based on my understanding, the frequency-based approach should account for this scenario, right?

I hope I was clear enough this time. Feel free to tell me if I wasn't though, I'll try explain better. Thanks :)

J535D165 commented 7 years ago

Thanks for your detailed explanation! Your problem seems quite complicated. I'm not sure if I can help you. I have never read a paper addressing this.

Some remarks:

How many matches like this do you have? Are there other ways to improve your linkage.

If there is anyone with suggestions on this topic, let me know!

mkdeak commented 7 years ago

Hi @J535D165,

I was very impressed by this repository. Based on your last comment above, I'm a bit confused if this is something I can deploy for my use case however: I'm looking for duplicated contacts in the contact list of my user(s). This means that the datasets A and B are same, with potentially many contacts in A that represent the same human. Is this something I can't use the Fellegi-Sunter for? I understand that your library has a number of other linking algorithms, although I won't be able to use supervised learning, because I don't currently have a training dataset...

I thought the Fellegi-Sunter algorithm was usable for my use case. Is my understanding wrong on this?

Thanks a lot for this!