catalyst-cooperative / ccai-entity-matching

An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.

MIT License

1 stars 2 forks source link

TF-IDF + Splink + Equal Weights #35

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Run the FERC1-EIA record linkage process using TF-IDF for string feature vectorization with naive equal weighting of features, and Splink to do the record linkage.

Parameters to vary

Choice of min/max lengths for n-grams generated by TF-IDF.
Vary the value of k in KNN or the minimum allowable cosine similarity used in blocking
Try using Splink supervised (based on our manual training data) and also unsupervised
Is there any useful exploration to be done in how we encode the non-string (numerical & categorical) features?

Evaluation criteria / outputs

Run time for the whole process.
Proportion of training data pairs excluded by the blocking strategy.
Reduction in the number of tuple pairs that need to be compared after blocking.
Proportion of identified matches that violate e.g. manual plant_id_pudl assignments or training data.
What fraction of the manually assigned training data matches have been recovered by the model