Messing around with the FERC to FERC match

katie-lamb commented 11 months ago

I made some changes to fix the environment and added a notebook to do the FERC to FERC match. I originally wanted to just plug it into the blocking module, but then decided that in the case of just vectorizing one dataset (not two for a record linkage problem) it makes more sense to just use the sklearn pipeline that's already in the PUDL module. Also, the blocking module in this repo could probably be cleaned up into a similar pipeline structure.

As for the match, I was hoping that more faiss functionality could be used, but then realized that there aren't a set number of clusters (plants in this case) that we're grouping the FERC records into. So then I thought I could use the cosine similarity matrix generated from faiss and pass it into the sklearn Agglomerative Clustering function. I couldn't make this function work with passing the distance matrix from an faiss index into the connectivity argument but maybe I was doing something wrong.

I think the results from the sklearn Agglomerative Clustering are looking pretty good, with not that many overlapping report years being grouped into the same cluster and the average distance between records in a cluster being really small. More cross validation of the parameters of the clustering model could be done.

review-notebook-app[bot] commented 11 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

katie-lamb commented 10 months ago

Current progress:

Vectorization in the same way as existing module
PCA dimension reduction
sklearn Agglomerative Clustering to group records

Remaining to do:

[x] Remove records from a cluster when there are duplicate report years. Choose the record with lower distance to the rest of the cluster
- Add a dimension to the vector to artificially increase the distance between records of the same year. Do some reading.
- Reassign the duplicate report years. Probably still useful to see how many duplicate report year groups we have
[x] More tinkering with the parameters of Agglomerative Clustering
- Try different distance functions
[ ] More validation. What other metrics should we be checking? What visualizations? How close to the original model should we try and get?
- More comparison to the original model, experiment tracking for size of cluster, overlapping report years, average distance between clusters, average distance within clusters
[x] What is downstream of this analysis? Do we need to warn/check for changes if this new model differs from the existing?
- Ask who would be effected by change in plant_id_ferc1.
[x] What infrastructure changes should be made to this PR or in PUDL before this analysis can go in?
- Rework the blocking module in this repo to be more of like a sklearn pipeline, then add a layer on top to handle blocking keys/groups. Have a config file that works for both the FERC to EIA and FERC to FERC match.
- Not entirely sure what to do about custom embedding functions
- Would be nice to have some sort of experiment tracking infrastructure.
- @zschira going to run the notebook and see how long it takes

zschira commented 10 months ago

Just updated the notebook to include some experiment tracking. I added the actual metric computation to src/ferc1_eia_match/metrics/ferc_to_ferc.py and it gets called from the notebook. Feel free to add new metrics as you see fit.

I also was doing a bit of research, and it looks like we can definitely add custom embedding functions to the pipelines if we'd like. I also was thinking about how to handle the issue of not matching multiple records from the same year, and I saw we can use custom distance functions for the clustering, so I was thinking we could have come up with a function that assigns an arbitrarily large distance for records from the same year then just computes euclidean distance (or something else) on the remaining features.

katie-lamb commented 10 months ago

@zschira I changed the notebook so that I'm passing in a custom distance matrix, which penalizes records with the same report year to be really far apart and uses Euclidean distance for everything else. Now, no records with the same report year are grouped together, which is great! I also messed around with the parameters of the clustering model and decided that the "average" linkage method, where clusters are grouped if the average distance to all the nodes in the cluster is lower than the threshold, makes more sense.

Left to do:

Make the blocking module into more of a pipeline
I still feel like some sort of visualization for validation would be helpful but validation metrics are looking okay right now. I think the output tree from the clustering could maybe be used to create a visualization.
I think a metric giving the average distance between nodes within clusters would be helpful. The model has a .distances_ parameter that I think can be used to get this but I got confused about what it represents. Needs more investigation.
Add more metrics to the experiment tracking

Reworking the blocking module is sort of blocking (lol), but I think it's close to where we can start drafting a PUDL PR.

katie-lamb commented 10 months ago

@zschira I changed the blocking module so that the user passes in list of transformer functions that can be used in a sklearn ColumnTransformer. Here's a remaining to do list before I think we can merge this PR in (and hopefully the PUDL PR too):

[ ] I'm not sure how to represent the new column_transformers argument that replaces embedding_map in ferc1_eia_match.package_data.blocking_config.json since this argument now has tuples with column transformers in it. Maybe this file shouldn't be a JSON anymore?
[ ] Get the CI to pass
[ ] Add a metric to the FERC to FERC that gives the distance between nodes as well as the distance between records within a node.

catalyst-cooperative / ccai-entity-matching

Messing around with the FERC to FERC match #114