NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
12 stars 3 forks source link

Add ability to sample from blocked pairs when training an FS model #44

Open jstammers opened 2 weeks ago

jstammers commented 2 weeks ago

The Felligi-Sunter model calculates weights by comparing the odds of a variable having a value amongst known pairs compared to randomly sampled pairs.

When using this model to evaluate the likelihood of candidate pairs being a match after blocking this can result in biased estimates, particularly if the variables are more similar between blocked pairs than two chosen at random.

For example, if blocking on a postcode, it is quite likely that two addresses will be fairly similar, even if they are distinct (e.g. same street name, different street number). Without properly considering this, the weights of an FS model could over-estimate the importance of the street name being the same and lead to inaccurate matching odds.

It would be useful to have a mechanism to only sample from blocked pairs when training an FS model, so that the sampled pairs have distributions that are closer to what would be expected of negative matches when using this model to infer matches after blocking. When using labelled known matches, if we sample from blocked pairs and make the assumption that pairs that don't share a record_id correspond to negative matches, it would also be possible to use this dataset to train supervised classification models, such as SVMs, boosted decision trees etc.

NickCrews commented 2 weeks ago

Thanks for the issue! To be honest, I haven't been using the FS model at all in my production uses, I have just been doing deterministic rules. So that is the reason this bit of mismo isn't very polished yet. I probably won't put too much effort in here at this time, but if you put in the work I am happy to review.

But as background: splink is also concerned with this. We block pairs differently from splink. They use blocking rules as you suggest to create pairs. We just use _train.sample_all_pairs(left, right, max_pairs=max_pairs). So I think one option is to just emulate splink.

Other things that splink does that are interesting: splink doesn't require your training blocking rules be the exact same as what you use for final comparison. They just suggest "The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the comparisons." They also take care of your concern: if you use postcode to block, then the Linker won't update u and m params for any variables that use the postcode column.

IDK, I don't think I quite grok this problem all the way, but it feels like there isn't some golden way to do it. All possible methods add bias one way or another.

Have you used splink before? looked at their code at all?

jstammers commented 5 days ago

I've used splink a little, but I wasn't overly familiar with some of the key concepts in Entity Resolution and therefore didn't spend much time with it.

The code there seems pretty straightforward to follow and I should have some time over the next few weeks to try to implement a method that will allow blocking rules to be used to (optionally) generate samples of pairs.

Excluding variables that make use of blocked columns might be a little trickier as this would require a method of inferring the columns that were used for each comparer. I guess these could be parsed from the sql that's generated for a LevelComparer, for example, but hopefully there's a cleaner way to do this

NickCrews commented 4 days ago

I haven't totally thought about this, but in the extreme case, which I think I want to support, a user could use a python UDF to implement a LevelComparer. This might get in the way of inferring which columns are used? IDK, splink has it nice cuz they are only ever deal with SQL, we might have to deal with nastier things. At the very least I think there might need to be an API in there to explicitly choose which levels/columns to not update. Keep me in the loop if you go for it, I don't want you to sink a bunch of time into an implementation I'm never gonna merge. Thank you!