catalyst-cooperative / ccai-entity-matching

An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
MIT License
1 stars 2 forks source link

Messing around with the FERC to FERC match #114

Closed katie-lamb closed 7 months ago

katie-lamb commented 11 months ago

I made some changes to fix the environment and added a notebook to do the FERC to FERC match. I originally wanted to just plug it into the blocking module, but then decided that in the case of just vectorizing one dataset (not two for a record linkage problem) it makes more sense to just use the sklearn pipeline that's already in the PUDL module. Also, the blocking module in this repo could probably be cleaned up into a similar pipeline structure.

As for the match, I was hoping that more faiss functionality could be used, but then realized that there aren't a set number of clusters (plants in this case) that we're grouping the FERC records into. So then I thought I could use the cosine similarity matrix generated from faiss and pass it into the sklearn Agglomerative Clustering function. I couldn't make this function work with passing the distance matrix from an faiss index into the connectivity argument but maybe I was doing something wrong.

I think the results from the sklearn Agglomerative Clustering are looking pretty good, with not that many overlapping report years being grouped into the same cluster and the average distance between records in a cluster being really small. More cross validation of the parameters of the clustering model could be done.

review-notebook-app[bot] commented 11 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

katie-lamb commented 10 months ago

Current progress:

Remaining to do:

zschira commented 10 months ago

Just updated the notebook to include some experiment tracking. I added the actual metric computation to src/ferc1_eia_match/metrics/ferc_to_ferc.py and it gets called from the notebook. Feel free to add new metrics as you see fit.

I also was doing a bit of research, and it looks like we can definitely add custom embedding functions to the pipelines if we'd like. I also was thinking about how to handle the issue of not matching multiple records from the same year, and I saw we can use custom distance functions for the clustering, so I was thinking we could have come up with a function that assigns an arbitrarily large distance for records from the same year then just computes euclidean distance (or something else) on the remaining features.

katie-lamb commented 10 months ago

@zschira I changed the notebook so that I'm passing in a custom distance matrix, which penalizes records with the same report year to be really far apart and uses Euclidean distance for everything else. Now, no records with the same report year are grouped together, which is great! I also messed around with the parameters of the clustering model and decided that the "average" linkage method, where clusters are grouped if the average distance to all the nodes in the cluster is lower than the threshold, makes more sense.

Left to do:

Reworking the blocking module is sort of blocking (lol), but I think it's close to where we can start drafting a PUDL PR.

katie-lamb commented 10 months ago

@zschira I changed the blocking module so that the user passes in list of transformer functions that can be used in a sklearn ColumnTransformer. Here's a remaining to do list before I think we can merge this PR in (and hopefully the PUDL PR too):