The entity matching process breaks down into two steps: blocking and matching.
Blocking
After cleaning and standardizing the data in both the FERC and EIA datasets, we perform a process called blocking, in which we remove record pairs that are unlikely to be matched from the candidate set of record pairs. This reduces computational complexity and boosts model performance, as we no longer need to evaluate n2 candidate match pairs and instead only evaluate a set of record pairs that are more likely to be matched. The goal of blocking is to create a set of candidate record pairs that is as small as possible while still containing all correctly matched pairs.
Rule Based Blocking
The simplest way we tried to create blocks of candidate pairs is with rule based blocking. This involves creating a set of heuristics that, when applied disjunctively, create "blocks" of record pairs that form a complete candidate set of pairs. This approach was too simple for our problem, and it was difficult to capture the training data matches without creating a very large candidate set.
It's worth noting that the output of rule based blocking can be combined with the output of an embedding vector approach described below to increase recall, while increasing the blocking output size only modestly (Thirumuruganathan, Li).
Embedding Vectors for Blocking
Instead of creating heuristics for blocking, we can create embedding vectors that represent the tuples in the FERC and EIA datasets and find the most similar pairs of embedding vectors to create a candidate set. This process involves three main steps.
Attribute Embedding: For each tuple t in the FERC and EIA datasets, compute an embedding vector for each attribute (column) in t.
Tuple Embedding: Combine each attribute embedding vector into one embedding vector for the tuple t.
Vector Pairing: Find similar vector pairs from the FERC and EIA datasets using a similarity metric and add the tuple pairs represented by these embedding vectors to the candidate set.
Attribute Embedding
There are multiple methods for embedding the string value attributes of the tuples.
TF-IDF:
Word Embeddings (word2vec, GloVe):
can be trained on the domain instead of pre-trained
Character Level (fastText) or Sub-Word Embedding (bi-grams):
can handle similarities in words like "generator", "generation"
better handles typos
The numeric attributes can be normalized within each column. (or should they go through the same embedding process as the string columns? in the case of TF-IDF does it matter if the numeric columns aren't on the same scale as the string columns?)
Tuple Embedding
Equal Weight Aggregation: Attribute embeddings are averaged together into one tuple embedding.
Weighted Aggregation: A weighted average is used to combine the attribute embeddings together into one tuple embedding. The weights of the attribute embeddings can optionally be learned.
Note: With aggregation methods, order is not considered: "Generator 10" has the same embedding as "10 Generator" (could be good or bad)
Self-Reproduction: Autoencoder or seq2seq (write this up)
Roughly speaking, they take a tuple t, feed it into a neural network (NN) to output a compact embedding vector u<sub>t</sub>, such that if we feed u<sub>t</sub> into a second NN, we can recover the original tuple t (or a good approximation of t). If this happens, u<sub>t</sub> can be viewed as a good compact summary of tuple t, and can be used as the tuple
embedding of t.
Vector Pairing
For all combinations of attribute and tuple embedding, we will use KNN cosine similarity to choose the vector pairs in the candidate set.
Evaluation Metric
Reduction Ratio (percentage that the candidate set has been reduced from n x n comparisons)
Pairs Completeness (percentage of matching record pairs contained within the reduced comparison space after blocking)
Harmonic Mean of Reduction Ratio and Pairs Completeness: 2 RR PC / (RR + PC)
These metrics work best for a rules based blocking method, where you can't adjust the size of the candidate set. Include metrics for blocking when Vector Pairing step is done at end to retain k most similar vector pairs.
Experiment Matrix
(Note: There could probably be more experimentation added with the way that numeric attributes are embedded and concatenated onto the record tuple embedding)
The entity matching process breaks down into two steps: blocking and matching.
Blocking
After cleaning and standardizing the data in both the FERC and EIA datasets, we perform a process called blocking, in which we remove record pairs that are unlikely to be matched from the candidate set of record pairs. This reduces computational complexity and boosts model performance, as we no longer need to evaluate n2 candidate match pairs and instead only evaluate a set of record pairs that are more likely to be matched. The goal of blocking is to create a set of candidate record pairs that is as small as possible while still containing all correctly matched pairs.
Rule Based Blocking
The simplest way we tried to create blocks of candidate pairs is with rule based blocking. This involves creating a set of heuristics that, when applied disjunctively, create "blocks" of record pairs that form a complete candidate set of pairs. This approach was too simple for our problem, and it was difficult to capture the training data matches without creating a very large candidate set.
It's worth noting that the output of rule based blocking can be combined with the output of an embedding vector approach described below to increase recall, while increasing the blocking output size only modestly (Thirumuruganathan, Li).
Embedding Vectors for Blocking
Instead of creating heuristics for blocking, we can create embedding vectors that represent the tuples in the FERC and EIA datasets and find the most similar pairs of embedding vectors to create a candidate set. This process involves three main steps.
t
in the FERC and EIA datasets, compute an embedding vector for each attribute (column) int
.t
.Attribute Embedding There are multiple methods for embedding the string value attributes of the tuples.
TF-IDF:
Word Embeddings (word2vec, GloVe):
Character Level (fastText) or Sub-Word Embedding (bi-grams):
The numeric attributes can be normalized within each column. (or should they go through the same embedding process as the string columns? in the case of TF-IDF does it matter if the numeric columns aren't on the same scale as the string columns?)
Tuple Embedding
Equal Weight Aggregation: Attribute embeddings are averaged together into one tuple embedding.
Weighted Aggregation: A weighted average is used to combine the attribute embeddings together into one tuple embedding. The weights of the attribute embeddings can optionally be learned.
Note: With aggregation methods, order is not considered: "Generator 10" has the same embedding as "10 Generator" (could be good or bad)
Roughly speaking, they take a tuple t, feed it into a neural network (NN) to output a compact embedding vector
u<sub>t</sub>
, such that if we feedu<sub>t</sub>
into a second NN, we can recover the original tuple t (or a good approximation of t). If this happens,u<sub>t</sub>
can be viewed as a good compact summary of tuple t, and can be used as the tuple embedding of t.Vector Pairing For all combinations of attribute and tuple embedding, we will use KNN cosine similarity to choose the vector pairs in the candidate set.
Evaluation Metric
These metrics work best for a rules based blocking method, where you can't adjust the size of the candidate set. Include metrics for blocking when Vector Pairing step is done at end to retain k most similar vector pairs.
Experiment Matrix (Note: There could probably be more experimentation added with the way that numeric attributes are embedded and concatenated onto the record tuple embedding)