Deep embeddings for schema matching and record linkage

jeancochrane commented 4 years ago

Background

In their 2019 paper "Recognizing Variables from their Data via Deep Embeddings of Distributions", Alex Smola and Jonas Mueller use learned embeddings -- lower-dimensional representations of columns produced by neural networks -- to match columns across datasets and make predictions about which columns refer to the same attributes (a task often called "schema matching").

Since schema matching is a task that we've been interested in researching for a while, we could get some value out of investigating Mueller and Smola's approach. What's more, I suspect that the approach may also transfer to the deduplication domain, allowing us to deduplicate and link datasets of much larger sizes. (For more detailed thoughts on this possibility, see my blog post on the paper).

Proposal

I propose to investigate deep embeddings for schema matching and deduplication. I'd like to start by attempting to reproduce Mueller and Smola's results from their paper. Then, I'd like to see if learned embeddings can transfer to deduplication and record linkage in order to address the clustering problem.

Deliverables

[x] A blog post summarizing the paper: https://jeancochrane.com/blog/schema-matching-with-deep-embeddings
[ ] A repo that reproduces Mueller and Smola's results
[ ] A Dedupe extension that uses learned embeddings and approximate nearest neighbors for clustering instead of logistic regression and hierarchical clustering with centroid linkage

Timeline

There's enough uncertainty in this R&D that it's hard to give a good estimate of how long it will take. My best guess is somewhere between three to six months.

I feel much more confident about reproducing Mueller and Smola's paper than I do about trying to extend Dedupe; the paper at least is putatively a solved problem, and the challenge will be getting it to work according to spec. Accordingly, I'll need more support for the work on Dedupe, and I'll try to focus on validating the approach as quickly as possible to determine if we need to abandon it or not.

jeancochrane commented 4 years ago

I'm going to put this on the backburner until I get to the part about researching new learning routines in #60.

jeancochrane commented 4 years ago

Picked this back up again today! I re-read my blog post and the Experiments section of the paper describing the experimental methodology. I was able to get a simple Makefile up and running for downloading all datasets from OpenML: https://github.com/jeancochrane/schema-matching/commit/1e0fd3e6a2ea141541b401a1e9fceb98a4e9a3cc

I only ran the Makefile for a small subset of data, so the next step is going to be downloading the full set of data. Then, I need to work on partitioning the data into train/test/validate sets as per the paper. Once that's done I'll be ready to start designing the neural nets.

One open problem I ran into today is that it's not entirely clear how the authors divided up the OpenML datasets. They say they looked for features of the type string, language, or numeric:

We partition the OpenML data into disjoint groups of numeric, language and general-string datasets which are handled separately. This means we only consider matching numeric datasets with numeric datasets, language data with other language data, and strings with strings, as well as training separate embedding models for each data type.

However, these don't seem to match up with the data types offered by OpenML, which follow the ARFF standard and include numeric, nominal, string, and date. I need to figure out a mapping that will return partitions similar to the ones recorded in the paper.

jeancochrane commented 3 years ago

This would still be really fun. I hope to pursue it someday in the future. But I don't think it makes sense to keep this active beyond my tenure so I'm closing for now.

datamade / how-to