Now that we've finished a first pass at record linkage, these are a list of improvements to try and next steps for model development.
Success Criteria
How will we know that we're done?
[ ] Record linkage performs well in a notebook and meets validation thresholds
### Next steps
- [x] Assign a sec_company_id to all companies in SEC basic info and Ex. 21
- [x] Make validation data
- [x] Run the models with validation data to benchmark performance
- [x] Add in the Ex. 21 subsidiaries to the SEC side and perform the SEC to EIA match
- [x] Create a method for clustering duplicate company records
- [x] Don't block on report year?
SEC to EIA model improvement ideas
[ ] Try a blocking rule that checks for overlap between an array column of company name metaphone
[ ] Create address array column and make an array intersection comparison
[ ] Try adding company name metaphone into the company name comparison level
[ ] Try filtering SEC data by sector - search for keywords in SIC (gas, electric, utility, etc)
[x] Refine Ex. 21 to SEC match now that the record ID is fixed
Overview
Now that we've finished a first pass at record linkage, these are a list of improvements to try and next steps for model development.
Success Criteria
How will we know that we're done?
SEC to EIA model improvement ideas