Open lmores opened 2 weeks ago
I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').
This can happen. Consider the level of _.name_l != "william shakespeare"
. This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.
Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.
Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?
Am I doing something obviously wrong?
Consider your level rates for postal code:
It looks like 100% of pairs fall into the EXACT level, which leads to an odds of (100%/100%) = 1. In other words, if you see a pair matching this level, it gives you no useful info as to whether the pair is a match.
It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?
IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues. Second, what I want is for there to be an "inverse relationship" as you go between levels: Your address lines jaccard comparison has this property: ELSE is common among non-matches, rare among true-matches. The other levels are common among true-matches, rare among non-matches. This leads to getting odds that actually give you evidence towards a match/non-match.
So, I think you need to adjust your level criteria to better "spread out" the record pairs between the different levels.
P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?
Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.
This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.
That makes sense. Just to double check my understanding: a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?
Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?
I am aware of the difference, but, since the chart is centered between 0.01 and 100, I was just wondering whether the resulting odds should mostly be contained in that interval.
It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?
It seems that both level contain no pairs. Here is my code:
# `compared` contains all address pairs that I am taking into account
# Pairs in MAJOR level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) == _.postal_code_r.substr(0,2)])
c.count() # returns 0
# Pairs in ELSE level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) != _.postal_code_r.substr(0,2)])
c.count() # returns 0
And is some more information available in the chart:
IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues.
Noted!
Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.
I'll have a look at altair.
Thank you!
a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?
Exactly! However, it SOMETIMES could be useful, in an indirect way. Consider if in ~90% of records the name field is NULL. They are equally likely to be NULL amongst matches as non-matches, so if you see a NULL in a pair, it doesn't give you any evidence of match or non-match. Consider level configuration A:
(_.name_l = _.name_r, "EXACT")
-> gives good evidence of a match(True, "ELSE")
-> doesn't give good evidence, the vast majority of pairs fall into this level, but its equally likely for them to be matches as non-matchesvs configuration B:
(_.name_l.isnull() | _.name_r.isnull(), "NULL")
-> the vast majority of pairs fall into here. Again, this level doesn't give much evidence either way.(_.name_l = _.name_r, "EXACT")
-> gives good evidence of a match, same as before.(True, "ELSE")
-> Now this gives good evidence of a non-match! All the "low-information" pairs were "filtered out" by the NULL level!It seems that both level contain no pairs
Yes, it looks like a very tiny number of pairs falls into each category, and so the calculated odds of 599.99... is probably not very accurate. As you understand now, aim for 100+ per bin.
EDIT: this actually looks like a large change, it would require changing the LevelWeights
class to actually keep track of number of pairs
instead of merely u
and m
as it does currently.
Your screen shot reveals that we don't currently show the number of pairs. That would be good to do. I will try to fix this, or if you want to submit a PR for that, also would be appreciated!
This thread is super useful for figuring out pain points, thank you for your effort here.
Ideally I would like this info to be exposed
I am playing with mismo to deduplicate postal addresses in a set of about 10k entries. After the expectation-maximization step, the odds of half of the record pairs are equal to
10_000_000_000
, hence choosing the threshold to distinguish true matches from false ones is quite hard.As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.
Deduplication Steps
0) Each record has the following fields:
record_id
recipients
: a string with the name of the recipientrecipients_metaphone
: the result ofdouble_metaphone()
on the fieldrecipients
recipients_tokens
: a sequence of tokens obtained by splitting therecipients
field using white spaces.address_lines
: a stringaddress_lines_tokens
: a sequence of tokens obtained by splitting theaddress_lines
field using white spaces and discarding terms that appears in more than 5% of the dataset (usingmismo.arrays.array_filter_isin_other()
andmismo.sets.rare_terms()
).full_address
: the whole address (not including the recipient's name)libpostal_address
: the address parsed by pypostal usingmismo.lib.geo.postal_parse_address()
libpostal_fingerprint
: the list fingerprints returned by pypostal usingmismo.lib.geo.postal_fingerprint_address()
1) Blocking using the following rules:
The corresponding upset chart is as follows
2) Use the following comparators:
postal_code
field with levels: EXACT, MAJOR (same first two digits), ELSErecipients_tokens
with thejaccard
function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSErecipients_metaphone
with thejaccard
function and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSEaddress_lines_tokens
with thejaccard
function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSElibpostal_fingerprint
with levels: AT_LEAST_ONE, ELSEThe weights after running
mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000)
are as follows:I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').
3) Scoring the pairs leads to
As you can see, the pairs in the left half of the chart all have odds equal to
10_000_000_000
. Also, the smallest value is1819
which is way above the "expected" (?) range between0.01
and100
.Am I doing something obviously wrong?
P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?