Assign FERC Plant IDs - Githubissues

zaneselvans commented 6 years ago

Organizing several sub-issues related to the FERC Plant ID assignments:

- [x] #215
- [x] #218
- [x] #219
- [x] #220
- [x] #223
- [x] #240
- [x] #242
- [x] #222
- [ ] #221
- [ ] #303
- [ ] #304
- [ ] #216
- [ ] #217
- [ ] #475

Original Issue Description:

CPI has created a spreadsheet that does combines the following fields from the f1_steam table to create a candidate ID that persists between years for reported FERC plants:

plant_name
year_constructed
plant_kind (within a well defined taxonomy)
respondent_id

The problem of matching/clustering entity names which aren't exactly the same (but should be) is common. In order to be able to do it in a computationally efficient way, the problem is often vectorized. A popular approach appears to be using the TF-IDF (Term Frequency * Inverse Document Frequency) metric, with n-grams (either word or character based) as the "terms." A nice example of this process is outlined here.

In addition to the strings that are the reported plant names, we also have the other categorical features, which we can integrate separately as "one hot" features (that match, or not) in order to independently weight the different features, before applying a clustering algorithm. Here's how that works.

A common and computationally efficient metric of n-dimensional vector similarity is the cosine of the angle separating the vectors. Here's an explanation of using cosine similarity to compare vectorized documents.

Some other related resources:

Text based feature extraction w/ scikit-learn
Dictionary (one-hot) vectorization: w/ scikit-learn.
More complex & powerful natural language tokenization is possible using the NLTK package.

Once we have the FERC plant features identified, and have chosen a metric of similarity to compare them, there are a bunch of different clustering algorithms we can use to pull them together... or we can just look at the top match for each year -- which will usually be a perfect match (since many utilities do use well defined names for their plants.) Once we've identified the FERC records that are similar enough for us to consider them the "same" plant, we can assign unique numerical IDs to each of them on a record-by-record basis.

Also need to incorporate the other filters & tests that CPI put into the spreadsheet, which include removing records with missing data and records that require human categorization (but which get flagged at ingest).

zaneselvans commented 6 years ago

Okay, so this is a little more complicated than I thought (who is surprised by this?).

We have the plant name (a string) and the three categorical features (construction year, respondent ID, and plant kind), and for the purposes of clustering, we really need to keep all three of them associated with each other for any given FERC plant record. But we also need to be able to scale their resulting feature vectors independently, to allow the features to be weighted differently so we can tune the system to most effectively identify plants.

For each apparently valid record in the FERC Plants table, we need to split out the features, vectorize & scale them, and then re-combine (and re-scale) the vectors before calculating the matrix of pairwise cosine similarities between all records. That matrix of record-to-record similarity is the raw material on which we would want to run the clustering analysis.

The clustering has a lot of interesting constraints to consider, which are both complicating, and helpful. If what we're trying to do is identify individual plants, then we might leave out the respondent ID, and only play with the plant type, plant name, and construction year. But this would result in clusters where the same plants are being reported by different utilities -- in the same year when they have joint ownership, and in different years when the plants change hands.

Those true plant clusters will be interesting (and give a good indication as to whether this is working) but don't get us to having a clean time series for each plant, which will allow us to do regressions and estimate the variable & fixed operating expenses associated with each plant historically. Within each plant cluster, we then need to organize the records into a collection of time series -- choosing no more than one record from each report year for each responding utility.

There are several parameters that can be tuned to enhance the results, but doing so would require having a set of labeled test data -- plants that we have identified by hand as being associated with each other -- and I don't know how much we would need. Tuning parameters would include:

The length of n-grams used to vectorize the plant names
The relative weights of the four inputs.

Those parameters could be optimized using a grid search or other automated mechanism. The method of extracting features from the string input might be more fuzzy.

zaneselvans commented 6 years ago

A good post on several different ways to do categorical feature encoding.

zaneselvans commented 6 years ago

All the feature vectorizations seem to be working... but need to figure out the right weights & synthesize the time series.

Refining the above clustering to take explicit account of the fact that we're trying to create a time series for each plant... sounds a lot like the original CPI approach (duh):

Calculate the distance metric (cosine similarity, etc.) between all pairs of plant records, with some set of weights for name, respondent id, year constructed, and plant time, and some set of name vectorization parameters (n-gram lengths, etc.)
Now iterate over all of the years of data. For each record in each year, find the most similar record in each of the other years, above some threshold of similarity (so if there really isn't a good match, the best but still bad match gets discarded). This will generate a cluster of plants around each plant record for each year.
Any cluster of plant records that is identical for all the years in which it has a member is a matched (inter-year) plant!
Then we need to look at the leftovers to see what we are missing...

Possible refinement: We know that many plants report individual generation units. Often this information only shows up in the plant name, which ends in a number, e.g. "2", or "Unit 2" or "#2" or "(2)" -- this is a small amount of text, but very important information for the purpose of matching up time series, which ought to be weighted separately in an ideal world. We could split it off into a 5th categorical feature field (e.g. unit_id) and remove it from the plant names, if there are a small enough number of patterns to match.

zaneselvans commented 6 years ago

There are two main constraints that we have to work with in selecting records within a cluster:

Each cluster may contain at most one record from each year.
Each record may be part of at most one cluster.

An outline of a process that uses the above constraints:

Filter similarity matrix with a threshold for acceptably good match.
For each record, find the best match in each year, subject to some minimum threshold of similarity.
For each record, get the list of all other records whose best matches it appears in.
Take the intersection of those two sets of records -- this is a candidate cluster.
Use each record in the candidate cluster as a seed to generate a candidate cluster.
If all records in the candidate cluster generate the same cluster, assign them all an id, and remove them from the list of records that need to be matched.
Recalculate which records are most similar to each other, given the removal of the clustered records.
Optionally, relax the constraint that each record can appear in at most one cluster by not removing them or re-calculating the best matches, to evaluate how much duplication exists and the impact of using a greedy algorithm on the clustering.

zaneselvans commented 5 years ago

The FERC time series generation is now structurally working -- pulling out collections of records, at most one per year, which are similar to each other, based on:

respondent_id (as a categorical feature)
plant_name (using TF-IDF on substrings)
tot_capacity (as a numeric feature)
plant_kind (as a categorical feature, after the field is coded using a thesaurus of strings)
yr_const (year of plant construction, as a categorical feature)
const_type (construction type, as a categorical feature)

How well the model chooses records to put in a time series depends on:

the relative weightings of the above features
the minimum required similarity for a feature to be potentially included in a series
the range of substring lengths used in the plant name vectorization

cmgosnell commented 1 year ago

@zaneselvans can we close this? this is a "nothing is ever done" type of thing but we've been using these id's for a verrrry long time at this point and if we want to revamp them we can always come back to this issue to see what was not incorporated/what you were thinking of next steps.

zaneselvans commented 1 year ago

Yeah, I guess so. I pulled #475 and #216 into #2386 since the memory intensity and slowness are significant hangups in the DAG right now.

I hope that at some point we can revisit this system with our learnings from the other record linkage projects we've done, and maybe get @katie-lamb's eyes on it.

catalyst-cooperative / pudl

Assign FERC Plant IDs #144

Original Issue Description: