NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Add active learning API for predicate-based blocking and matching #54

Open jstammers opened 3 months ago

jstammers commented 3 months ago

The active learning method that dedupe implements to learn a minimal set of predicates to block records can be very useful, particularly when the user has little prior knowledge on an appropriate set of blocking rules. In some of my usage, I've found that it can be tricky to balance the recall and precision when blocking on a single feature, e.g. zip code. More complex predicates would most likely help reduce the number of candidate pairs, so having a semi-supervised method to learn these could reduce the manual work required.

In practice, I have found that dedupe's implementation scales quite poorly - I frequently hit memory bottlenecks when blocking on datasets of more than 10k rows. I'm hoping that by using duckdb via ibis we may be able to do something more performative.

At a high-level, the active learner works as follows:

If we were to implement something like this, I think it's worth ensuring this can be done using the current mismo API.

For example, I can see how the candidate pairs could be blocked using a UnionBlocker and appropriately defined ConditionBlockers. Similarly, the features for the MatchLearner could be generated using a set of LevelComparers - which then learns to predict p(match | comparisons).

It's not immediately clear to me how best to decide the predicates for blocking based on labelled examples, but I expect that could be done by keeping track of the blocking rules. How efficient this is when we have many blockers remains to be seen.

I like the idea of leaving the predicates and features open to the user, as dedupe's approach of statically-defining the predicates based on the data type makes it hard for me to understand where the performance bottle-necks are for a given matching/linking task.

NickCrews commented 3 months ago

(note I am trying to use the term "blocking key" instead of "predicate", I never liked the term predicate from dedupe, I usually think of predicate as a very specific meaning of a function that returns a boolean. I want to phrase this process more in terms of table joins, so "join key" "blocking key", etc make more sense to me)

In general, I am very much in favor of improving the blocking user experience. Manually defining blocking keys and combos of keys feels way too arbitrary and not data-informed.

I've found that it can be tricky to balance the recall and precision when blocking on a single feature, e.g. zip code

Would this proposal actually solve this problem? Or is it just inherent in your dataset, eg there just isn't a whole lot of useful info we could use? Can you manually "cherry-pick" a gold-standard selection of predicates that does get you the performance you are looking for, and perhaps we can reverse engineer from there, approaching the problem as "what algorithm could we come up with that would be able to find this set of predicates out of the total search space"? I want to take a step back and really think about the problem holistically before we re-implement what dedupe did (which is fricking neat algorithmically, but I'm intimidated by its complexity, I would love to avoid it if possible).

I would also love to avoid any active or supervised learning, any unsupervised method will get a lot more traction from me. It seems like we could do something similar with an expectation maximization algorithm here???

For example, I'm curious if there is a much simpler algorithm:

  1. consider all the possible atomic blocking keys you could use. Generate all possible compound blocking keys from this.
  2. Remove any candidates that generate "too many" (configurable, but something like 1M by default?) pairs. This works in O(N) time for each candidate using KeyBlocker.pair_counts(). We can start with very complex compound keys, like (first name, last name, zipcode), and if this yields too many pairs, then we know that all "more simple" keys such as (first name, last name), will also yield too many pairs, so we can cut down our work here.
  3. With the remaining candidates generate sample_size (100k by default?) pairs for each key. Given a user-supplied PScorerer (a specific PComparer that adds a p_match column, the user is gonna need to define this thing anyways for downstream steps so it shouldn't be additional work?), we calculate some metrics of how "good" this compound blocking key is (precision, n_pairs, execution time, other ones?) and possibly discard it
  4. The candidate compound keys that are left are the ones we use.

I like the idea of leaving the predicates and features open to the user

Yes, there should be a BlockingKeyNominator protocol that is something like ibis.Table -> Iterable[ir.Column], and we can have some built-in nominators as dedupe does, but users must be able to make their own for their domain-specific task.

jstammers commented 3 months ago

Thanks for the feedback @NickCrews. I've been progressing further with the dataset I'm working on at the moment and am reaching similar conclusions to you about this approach.

I want to take a step back and really think about the problem holistically before we re-implement what dedupe did (which is fricking neat algorithmically, but I'm intimidated by its complexity, I would love to avoid it if possible).

After looking into Dedupe's API in more detail, I agree that while its method is very sophisticated, it is somewhat intimidating in its complexity. I also wonder how well it would scale to large datasets, which I see as one of the main attractions of mismo.

For example, I'm curious if there is a much simpler algorithm

The first two steps you've described are more-or-less what I've done manually. In my particular case, there were some records that have a value Deprecated that massively increased the number of potential pairs that would have otherwise been more manageable. I can fix this with some preprocessing to nullif that column, but if doing this alogrithimically, a UserWarning might be helpful in cases where the pair_counts is heavily skewed towards one example that ought to be excluded

I like the idea of using a PScorer to quantify the match likelihood. If this is something like an FS model trained using EM/other labelled data (or some other probabilistic model), then this makes a lot of sense to me. I'm not sure what metrics would be appropriate in an unsupervised case as e.g. precision requires labelled positive/negative matches. Some summary statistics on the distribution of p_match seems like a good starting point. Perhaps something like the Gini Impurity could be used to measure how well a given blocker separates the pairs into groups of high/low match probabilities, but I need to think about this a little further

I would also love to avoid any active or supervised learning, any unsupervised method will get a lot more traction from me. It seems like we could do something similar with an expectation maximization algorithm here???

I'd be interested to get your thoughts on this in general. I've been working a little bit on using an active learning method to generate pairs of records that can be trained using a logistic regression model. The model itself is fairly easy to implement using mismo's API - I can define a DistanceComparer that returns a continuous variable that measures the distance between two fields and a LogisticWeights class that can serialise the coefficients of the logistic regression model into a JSON object. My main goal here is to have a method of classifying matches that can use continuous variables, because I'd rather avoid the manual work of defining the discrete levels. Since logistic regression is very closely related to naive bayes (which is essentially what the FS model is, from my understanding), this seemed like a good approach to take. For unsupervised methods, kmeans seems like a good starting point.

I've tried out some active learning using modal-python, but this, along with a labelling UI, introduces some complexity that might be better elsewhere.

This feels like a separate, but related topic, so I'm happy to split this out into a separate issue if you'd prefer to keep this focused on blocking