High odds in Fellegi-Sunter model after expectation maximization

I am playing with mismo to deduplicate postal addresses in a set of about 10k entries. After the expectation-maximization step, the odds of half of the record pairs are equal to 10_000_000_000, hence choosing the threshold to distinguish true matches from false ones is quite hard.

As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.

Deduplication Steps

0) Each record has the following fields:

record_id
recipients: a string with the name of the recipient
recipients_metaphone: the result of double_metaphone() on the field recipients
recipients_tokens: a sequence of tokens obtained by splitting the recipients field using white spaces.
address_lines: a string
address_lines_tokens: a sequence of tokens obtained by splitting the address_lines field using white spaces and discarding terms that appears in more than 5% of the dataset (using mismo.arrays.array_filter_isin_other() and mismo.sets.rare_terms()).
full_address: the whole address (not including the recipient's name)
libpostal_address: the address parsed by pypostal using mismo.lib.geo.postal_parse_address()

libpostal_fingerprint: the list fingerprints returned by pypostal using mismo.lib.geo.postal_fingerprint_address()

1) Blocking using the following rules:

[
mismo.block.KeyBlocker("recipients", name="Recipients Exact"),
mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ),
mismo.block.KeyBlocker("recipients_metaphone", name="Recipients phonetic", ),
]

The corresponding upset chart is as follows

blockers-upset

2) Use the following comparators:

on postal_code field with levels: EXACT, MAJOR (same first two digits), ELSE
on recipients_tokens with the jaccard function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
on recipients_metaphone with the jaccardfunction and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSE
on address_lines_tokens with the jaccard function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
on libpostal_fingerprint with levels: AT_LEAST_ONE, ELSE

The weights after running mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000) are as follows:

comparers

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

3) Scoring the pairs leads to

match-levels match-weights2

As you can see, the pairs in the left half of the chart all have odds equal to 10_000_000_000. Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Am I doing something obviously wrong?

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.

Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?

Am I doing something obviously wrong?

Consider your level rates for postal code:

It looks like 100% of pairs fall into the EXACT level, which leads to an odds of (100%/100%) = 1. In other words, if you see a pair matching this level, it gives you no useful info as to whether the pair is a match.

It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?

IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues. Second, what I want is for there to be an "inverse relationship" as you go between levels: Your address lines jaccard comparison has this property: ELSE is common among non-matches, rare among true-matches. The other levels are common among true-matches, rare among non-matches. This leads to getting odds that actually give you evidence towards a match/non-match.

So, I think you need to adjust your level criteria to better "spread out" the record pairs between the different levels.

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.

This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.

That makes sense. Just to double check my understanding: a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?

Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?

I am aware of the difference, but, since the chart is centered between 0.01 and 100, I was just wondering whether the resulting odds should mostly be contained in that interval.

It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?

It seems that both level contain no pairs. Here is my code:

# `compared` contains all address pairs that I am taking into account

# Pairs in MAJOR level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) == _.postal_code_r.substr(0,2)])
c.count()    # returns 0

# Pairs in ELSE level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) != _.postal_code_r.substr(0,2)])
c.count()  # returns 0

And is some more information available in the chart: Screenshot from 2024-09-05 20-41-24

IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues.

Noted!

Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.

I'll have a look at altair.

Thank you!

a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?

Exactly! However, it SOMETIMES could be useful, in an indirect way. Consider if in ~90% of records the name field is NULL. They are equally likely to be NULL amongst matches as non-matches, so if you see a NULL in a pair, it doesn't give you any evidence of match or non-match. Consider level configuration A:

(_.name_l = _.name_r, "EXACT") -> gives good evidence of a match
(True, "ELSE") -> doesn't give good evidence, the vast majority of pairs fall into this level, but its equally likely for them to be matches as non-matches

vs configuration B:

(_.name_l.isnull() | _.name_r.isnull(), "NULL") -> the vast majority of pairs fall into here. Again, this level doesn't give much evidence either way.
(_.name_l = _.name_r, "EXACT") -> gives good evidence of a match, same as before.
(True, "ELSE") -> Now this gives good evidence of a non-match! All the "low-information" pairs were "filtered out" by the NULL level!

It seems that both level contain no pairs

Yes, it looks like a very tiny number of pairs falls into each category, and so the calculated odds of 599.99... is probably not very accurate. As you understand now, aim for 100+ per bin.

EDIT: this actually looks like a large change, it would require changing the LevelWeights class to actually keep track of number of pairs instead of merely u and m as it does currently.

Your screen shot reveals that we don't currently show the number of pairs. That would be good to do. I will try to fix this, or if you want to submit a PR for that, also would be appreciated!

This thread is super useful for figuring out pain points, thank you for your effort here.

Ideally I would like this info to be exposed

ideally, inline inside the altair charts, or raised as warnings, or something that really gives you hints as you actually use the code.
if that is not possible, in the documentation as a how-to guide or a troubleshooting, etc.

NickCrews / mismo

High odds in Fellegi-Sunter model after expectation maximization #57