Closed katie-lamb closed 7 months ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Current progress:
sklearn Agglomerative Clustering
to group recordsRemaining to do:
Agglomerative Clustering
plant_id_ferc1
. sklearn
pipeline, then add a layer on top to handle blocking keys/groups. Have a config file that works for both the FERC to EIA and FERC to FERC match.Just updated the notebook to include some experiment tracking. I added the actual metric computation to src/ferc1_eia_match/metrics/ferc_to_ferc.py and it gets called from the notebook. Feel free to add new metrics as you see fit.
I also was doing a bit of research, and it looks like we can definitely add custom embedding functions to the pipelines if we'd like. I also was thinking about how to handle the issue of not matching multiple records from the same year, and I saw we can use custom distance functions for the clustering, so I was thinking we could have come up with a function that assigns an arbitrarily large distance for records from the same year then just computes euclidean distance (or something else) on the remaining features.
@zschira I changed the notebook so that I'm passing in a custom distance matrix, which penalizes records with the same report year to be really far apart and uses Euclidean distance for everything else. Now, no records with the same report year are grouped together, which is great! I also messed around with the parameters of the clustering model and decided that the "average" linkage method, where clusters are grouped if the average distance to all the nodes in the cluster is lower than the threshold, makes more sense.
Left to do:
.distances_
parameter that I think can be used to get this but I got confused about what it represents. Needs more investigation.Reworking the blocking module is sort of blocking (lol), but I think it's close to where we can start drafting a PUDL PR.
@zschira I changed the blocking module so that the user passes in list of transformer functions that can be used in a sklearn ColumnTransformer
. Here's a remaining to do list before I think we can merge this PR in (and hopefully the PUDL PR too):
column_transformers
argument that replaces embedding_map
in ferc1_eia_match.package_data.blocking_config.json
since this argument now has tuples with column transformers in it. Maybe this file shouldn't be a JSON anymore?
I made some changes to fix the environment and added a notebook to do the FERC to FERC match. I originally wanted to just plug it into the blocking module, but then decided that in the case of just vectorizing one dataset (not two for a record linkage problem) it makes more sense to just use the
sklearn
pipeline that's already in the PUDL module. Also, the blocking module in this repo could probably be cleaned up into a similar pipeline structure.As for the match, I was hoping that more
faiss
functionality could be used, but then realized that there aren't a set number of clusters (plants in this case) that we're grouping the FERC records into. So then I thought I could use the cosine similarity matrix generated fromfaiss
and pass it into thesklearn
Agglomerative Clustering function. I couldn't make this function work with passing the distance matrix from anfaiss
index into theconnectivity
argument but maybe I was doing something wrong.I think the results from the
sklearn
Agglomerative Clustering are looking pretty good, with not that many overlapping report years being grouped into the same cluster and the average distance between records in a cluster being really small. More cross validation of the parameters of the clustering model could be done.