Run the FERC1-EIA record linkage process using TF-IDF for string feature vectorization with naive equal weighting of features, and Splink to do the record linkage.
Parameters to vary
Choice of min/max lengths for n-grams generated by TF-IDF.
Vary the value of k in KNN or the minimum allowable cosine similarity used in blocking
Try using Splink supervised (based on our manual training data) and also unsupervised
Is there any useful exploration to be done in how we encode the non-string (numerical & categorical) features?
Evaluation criteria / outputs
Run time for the whole process.
Proportion of training data pairs excluded by the blocking strategy.
Reduction in the number of tuple pairs that need to be compared after blocking.
Proportion of identified matches that violate e.g. manual plant_id_pudl assignments or training data.
What fraction of the manually assigned training data matches have been recovered by the model
Run the FERC1-EIA record linkage process using TF-IDF for string feature vectorization with naive equal weighting of features, and Splink to do the record linkage.
Parameters to vary
Evaluation criteria / outputs
plant_id_pudl
assignments or training data.