Integrate Remaining EIA-860 & EIA-923 Tables

cmgosnell commented 3 years ago

We have integrated most of the tables from EIA 860 and 923, but we're still missing several. This issue collects all tables that are still missing, so we can keep track of our progress towards complete data integration.

### Remaining EIA 860 Tables
- [ ] #1159
- [ ] #1162

### Remaining EIA 923 Tables
- [ ] https://github.com/catalyst-cooperative/pudl/issues/457
- [ ] #2448
- [ ] https://github.com/catalyst-cooperative/pudl/issues/458
- [ ] https://github.com/catalyst-cooperative/pudl/issues/1302
- [ ] Integrate EIA 923 Stocks Data
- [ ] Integrate EIA 923 plant frame data
- [ ] Integrate EIA 923 Schedule 6_7 Source and Disposition data

### Harvesting tasks
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3365

zaneselvans commented 2 years ago

@grgmiller I feel like I just saw a comment from you somewhere about tackling one or some of this missing data, but I can't find it now. Do you still need/want some guidance on the steps to bringing a new table in?

grgmiller commented 2 years ago

Hi Zane, yes I just posted in the general slack channel about that. That would be helpful if the guidance exists!

Mainly just understanding where I need to add new table names, field names, metadata, etc would be helpful. I could take a trial and error approach where I use the harvesting_debug.ipynb notebook to try and load a new table specified in the settings yaml file until it fails and then fixing the issue, but that seems like a inefficient approach.

zaneselvans commented 2 years ago

Okay @grgmiller just so we have this somewhere relevant and can migrate it to the docs if it turns out to be correct, here's what I think needs to happen to get a new table from the EIA spreadsheets integrated. It kind of trails off in the details at the end but this should be enough to get you started!

Raw Data Source Metadata

If we're not already parsing it, the spreadsheet page will need to be given a name for use in PUDL that represents that semantic content across all years for which it is reported. E.g. in the EIA-923 there's a tab which is often labeled "Fuel Receipts and Costs" and we've named that page fuel_receipts_costs_eia923.
If the new page(s) to be parsed is(are) found in a file that isn't currently being parsed, then a year-by-year mapping between the name of the page and the name of the file in which it appears needs to be compiled. For example, these mappings for the EIA-860 currently reside in src/pudl/package_data/eia860/file_map.csv
Since many of these files have more than one tab in them, you also need to compile information about which tab contains the target information, which happens in e.g. src/pudl/package_data/eia860/page_map.csv
You also need to indicate how many rows to skip at the top and the bottom of the file, so we don't read in garbage data or merged cells. See src/pudl/package_data/eia860/skiprows.csv and src/pudl/package_data/eia860/skipfooter.csv
Note that we use a sentinel value of -1 to indicate missing data in all of these CSVs.
Now for every year of data that the page you're adding exists in, the ExcelExtractor will know what file to open, which tab to go to, and what rectangle of cells to select.
EIA changes the names of the columns frequently, so we also need to map every year's worth of simplified EIA column names (we convert them all to snake_case first) to the canonical column names that will be used within PUDL. This is done on a page-by-page basis in the CSVs under src/pudl/package_data/{data source}/column_maps/{page name}.csv.
This allows all of the years of comparable data to be concatenated together into a single dataframe for uniform processing very early on.
Note: before creating any new column definitions, make sure that the columns you are trying to define do not already exist in src/pudl/metadata/fields.py.
Note: Not every column that you define for extraction purposes here will end up in the database. For example, you might have a variable reported in kWh, but in the PUDL DB it's been standardized to use MWh. In that case you might have a net_generation_kwh column for extraction, which isn't defined in the DB, because by the time it's loaded into the DB the same data is now in a column named net_generation_mwh. But most columns do appear in both places. We try to avoid unnecessary renames, but also to ensure that the meaning described by the column name represents the contents of the column, and change them both simultaneously in the Transform step.
We have naming conventions for the columns. name and id and code have specific meanings. Data source specific fields end with a data source suffix. E.g. plant_id_eia is a plant identifier that only makes sense if you look it up in its home table (where it will typically be a or the primary key). It will probably be an integer (though generator_id is a string that's often but not always an integer), and the ID will be consistent across all EIA data sources. All column names are snake_case and we use units as suffixes for data columns, e.g grid_voltage_kv or net_generation_mwh.

Data Extraction

Once you've got all of the columns mapped, you should be able to try extracting the data using the ExcelExtractor subclasses that exist for each EIA form. There are DataSource classes that define the years of available data. I don't think you should have to modify those. You would need to update e.g. the Eia860Settings class to recognize the newly named table. The best notebook to use for interactive development here is probably devtools/eia-etl-debug.ipynb. You can define an Eia860Settings object manually, and hand that in to the pudl.extract.eia860.Extractor.extract() method to see what happens. You should get back a dictionary of dataframes where the keys are the page names, including your newly defined page, and the values are dataframes containing the concatenated data across all years available.

Data Transformation

Each table gets its own transformation function. Sadly this isn't totally uniform across the different data sources yet, but look in the transform() function and the other functions in any of the sub-packages named by data source under src/pudl/transform to see how your new table / function should be integrated.
At the end of the Transform step, every column in the dataframe that has information you want to retain should have a name which is defined in fields.py
After extract and transform, all the EIA data gets normalized (de-duplicated and checked for internal consistency) in a process we call "harvesting" where all attribute values which appear in conjunction with a given set of entity-defining primary keys (e.g. report_date, plant_id_eia, generator_id) are compiled together, and the most likely to be correct value is selected.
Some attributes are fixed (the same across all years of data), others are annual. Then there are the data fields (stuff like net generation).
What happens to these different kinds of columns is determined by the big dictionary in src/pudl/metadata/resources/__init__.py. It indicates which attributes belong to what entities (utilities, plants, generators, boilers...) and whether they are fixed or annual.

Database Schema Definition

New column (field) definitions should go in src/pudl/metadata/fields.py. They are sorted alphabetically. At the very least they need a globally unique name, a data type (see the examples in there for available types, they're based on the tabular data package standard, and converted to SQLite / Pandas / Arrow types by the code as appropriate), and a description explaining what the column means.
New table (aka resource) definitions go in src/pudl/metadata/resources/{data source}.py
There's some magic in those definitions with respect to defining primary keys and foreign key relationships.
Every field/column that shows up in a resource/table definition also needs to be defined in the fields.py.

catalyst-cooperative / pudl