This epic lists the combinations of techniques that we want to explore for performing the FERC-EIA record linkage. The categories include:
Blocking Strategies
The blocking step dramatically reduces the number of pairs of records that need to be compared, making the problem computationally feasible. There are several parts:
String attribute embedding methods that are used to turn text like plant or utility names into numerical features (TF-IDF, word2vec, and fastText)
Tuple embedding methods that are used to combine distinct vectorized features into a single vector representing the whole tuple. These include either setting a priori or learning the relative weights (importance) of the various feature vectors to be combined, or using a neural network to reduce the dimensionality of the feature vector. Options include seq2seq and AutoEncoders (TensorFlow example)
Choice of threshold: Once we've got the tuple embedding, how do we pick a subset of records to compare to each other? E.g. k-nearest neighbors (KNN) or some minimum threshold value like cosine similarity >= 0.75.
There's also the old-school rule based blocking, where we pick some heuristics that split up the records along reasonable lines (e.g. only compare records from the same report year or state). This can potentially be used in combination with the above strategies as a pre-filter.
Record Linkage Models
These operate on the subset of pairs of records that were identified as potential matches in the blocking step. The options we're exploring are
Splink: a logistic regression model that can be run supervised or unsupervised.
Probabalistic Graph Models (PGMs) which can be used to find a consensus among several noisy labeling functions (aka weak supervision)
This epic lists the combinations of techniques that we want to explore for performing the FERC-EIA record linkage. The categories include:
Blocking Strategies
The blocking step dramatically reduces the number of pairs of records that need to be compared, making the problem computationally feasible. There are several parts:
Record Linkage Models
These operate on the subset of pairs of records that were identified as potential matches in the blocking step. The options we're exploring are