catalyst-cooperative / ccai-entity-matching

An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
MIT License
1 stars 2 forks source link

FERC-EIA Record Linkage Experiments #34

Closed zaneselvans closed 8 months ago

zaneselvans commented 1 year ago

This epic lists the combinations of techniques that we want to explore for performing the FERC-EIA record linkage. The categories include:

Blocking Strategies

The blocking step dramatically reduces the number of pairs of records that need to be compared, making the problem computationally feasible. There are several parts:

Record Linkage Models

These operate on the subset of pairs of records that were identified as potential matches in the blocking step. The options we're exploring are

## Experiments to Run
- [ ] https://github.com/catalyst-cooperative/ccai-entity-matching/issues/35
- [ ] https://github.com/catalyst-cooperative/ccai-entity-matching/issues/36
- [ ] TF-IDF + Splink + Weighted Aggregation
- [ ] TF-IDF + Splink + autoencoder
- [ ] TF-IDF + Splink + seq2seq
- [ ] TF-IDF + PGM + Equal Weights
- [ ] TF-IDF + PGM + Weighted Aggregation
- [ ] TF-IDF + PGM + autoencoder
- [ ] TF-IDF + PGM + seq2seq
- [ ] word2vec + Splink + Equal Weights
- [ ] word2vec + Splink + Weighted Aggregation
- [ ] word2vec + Splink + autoencoder
- [ ] word2vec + Splink + seq2seq
- [ ] word2vec + PGM + Equal Weights
- [ ] word2vec + PGM + Weighted Aggregation
- [ ] word2vec + PGM + autoencoder
- [ ] word2vec + PGM + seq2seq
- [ ] fastText + Splink + Equal Weights
- [ ] fastText + Splink + Weighted Aggregation
- [ ] fastText + Splink + autoencoder
- [ ] fastText + Splink + seq2seq
- [ ] fastText + PGM + Equal Weights
- [ ] fastText + PGM + Weighted Aggregation
- [ ] fastText + PGM + autoencoder
- [ ] fastText + PGM + seq2seq