catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage
MIT License
0 stars 0 forks source link

First pass at record linkage #94

Open katie-lamb opened 3 days ago

katie-lamb commented 3 days ago

Overview

See record linkage design doc for diagram and more notes.

We want to conduct record linkage between the SEC filers we've extracted and the companies (owner and operator utilities) that file with EIA (Form 860 and 861). There are a few more steps to make sure we have the correct data from both sides, and then we can do a proof of concept to make sure that splink can effectively connect SEC to EIA.

Success Criteria

### Tasks
- [x] Check if there's utility address information reported in raw EIA 861 - see #96 
- [x] Check that all owner companies are reported in the EIA 860 utilities table
- [ ] Make a plan for reading in all SEC filer data
- [ ] Create a validation set of 50 manually mapped EIA utilities to SEC filers
- [ ] Deduplicate SEC Ex. 21 subsidiary companies, assign a unique ID for all SEC companies
- [ ] #93 
- [ ] Fuzzy match the Ex. 21 subsidiary companies to SEC filers
katie-lamb commented 2 days ago

The EIA 861 merger's table (core_eia861__yearly_mergers) has address information for the new parent company that can be joined on by utility_id_eia, but there are only a small number of utilities in this table (219 rows). But the merger_company doesn't have an ID and doesn't look super clean (it's a string).

Also, we aren't harvesting utilities from 861, probably we should do this in parallel with developing the match.