NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
13 stars 3 forks source link

High odds in Fellegi-Sunter model after expectation maximization #57

Open lmores opened 2 weeks ago

lmores commented 2 weeks ago

I am playing with mismo to deduplicate postal addresses in a set of about 10k entries. After the expectation-maximization step, the odds of half of the record pairs are equal to 10_000_000_000, hence choosing the threshold to distinguish true matches from false ones is quite hard.

As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.

Deduplication Steps

0) Each record has the following fields:

The corresponding upset chart is as follows

blockers-upset

2) Use the following comparators:

The weights after running mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000) are as follows:

comparers

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

3) Scoring the pairs leads to

match-levels match-weights2

As you can see, the pairs in the left half of the chart all have odds equal to 10_000_000_000. Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Am I doing something obviously wrong?

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

NickCrews commented 2 weeks ago

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.

Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?

Am I doing something obviously wrong?

Consider your level rates for postal code:

image

It looks like 100% of pairs fall into the EXACT level, which leads to an odds of (100%/100%) = 1. In other words, if you see a pair matching this level, it gives you no useful info as to whether the pair is a match.

It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?

IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues. Second, what I want is for there to be an "inverse relationship" as you go between levels: Your address lines jaccard comparison has this property: ELSE is common among non-matches, rare among true-matches. The other levels are common among true-matches, rare among non-matches. This leads to getting odds that actually give you evidence towards a match/non-match.

So, I think you need to adjust your level criteria to better "spread out" the record pairs between the different levels.

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.

lmores commented 1 week ago

This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.

That makes sense. Just to double check my understanding: a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?

Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?

I am aware of the difference, but, since the chart is centered between 0.01 and 100, I was just wondering whether the resulting odds should mostly be contained in that interval.

It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?

It seems that both level contain no pairs. Here is my code:

# `compared` contains all address pairs that I am taking into account

# Pairs in MAJOR level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) == _.postal_code_r.substr(0,2)])
c.count()    # returns 0

# Pairs in ELSE level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) != _.postal_code_r.substr(0,2)])
c.count()  # returns 0

And is some more information available in the chart: Screenshot from 2024-09-05 20-41-24 Screenshot from 2024-09-05 20-40-49

IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues.

Noted!

Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.

I'll have a look at altair.

Thank you!

NickCrews commented 1 week ago

a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?

Exactly! However, it SOMETIMES could be useful, in an indirect way. Consider if in ~90% of records the name field is NULL. They are equally likely to be NULL amongst matches as non-matches, so if you see a NULL in a pair, it doesn't give you any evidence of match or non-match. Consider level configuration A:

vs configuration B:

It seems that both level contain no pairs

Yes, it looks like a very tiny number of pairs falls into each category, and so the calculated odds of 599.99... is probably not very accurate. As you understand now, aim for 100+ per bin.

EDIT: this actually looks like a large change, it would require changing the LevelWeights class to actually keep track of number of pairs instead of merely u and m as it does currently.

Your screen shot reveals that we don't currently show the number of pairs. That would be good to do. I will try to fix this, or if you want to submit a PR for that, also would be appreciated!

NickCrews commented 1 week ago

This thread is super useful for figuring out pain points, thank you for your effort here.

Ideally I would like this info to be exposed