Open cmgosnell opened 3 years ago
@grgmiller I feel like I just saw a comment from you somewhere about tackling one or some of this missing data, but I can't find it now. Do you still need/want some guidance on the steps to bringing a new table in?
Hi Zane, yes I just posted in the general slack channel about that. That would be helpful if the guidance exists!
Mainly just understanding where I need to add new table names, field names, metadata, etc would be helpful. I could take a trial and error approach where I use the harvesting_debug.ipynb
notebook to try and load a new table specified in the settings yaml file until it fails and then fixing the issue, but that seems like a inefficient approach.
Okay @grgmiller just so we have this somewhere relevant and can migrate it to the docs if it turns out to be correct, here's what I think needs to happen to get a new table from the EIA spreadsheets integrated. It kind of trails off in the details at the end but this should be enough to get you started!
fuel_receipts_costs_eia923
.src/pudl/package_data/eia860/file_map.csv
src/pudl/package_data/eia860/page_map.csv
src/pudl/package_data/eia860/skiprows.csv
and src/pudl/package_data/eia860/skipfooter.csv
snake_case
first) to the canonical column names that will be used within PUDL. This is done on a page-by-page basis in the CSVs under src/pudl/package_data/{data source}/column_maps/{page name}.csv
.src/pudl/metadata/fields.py
.net_generation_kwh
column for extraction, which isn't defined in the DB, because by the time it's loaded into the DB the same data is now in a column named net_generation_mwh
. But most columns do appear in both places. We try to avoid unnecessary renames, but also to ensure that the meaning described by the column name represents the contents of the column, and change them both simultaneously in the Transform step.name
and id
and code
have specific meanings. Data source specific fields end with a data source suffix. E.g. plant_id_eia
is a plant identifier that only makes sense if you look it up in its home table (where it will typically be a or the primary key). It will probably be an integer (though generator_id
is a string that's often but not always an integer), and the ID will be consistent across all EIA data sources. All column names are snake_case
and we use units as suffixes for data columns, e.g grid_voltage_kv
or net_generation_mwh
.ExcelExtractor
subclasses that exist for each EIA form. There are DataSource
classes that define the years of available data. I don't think you should have to modify those. You would need to update e.g. the Eia860Settings
class to recognize the newly named table. The best notebook to use for interactive development here is probably devtools/eia-etl-debug.ipynb
. You can define an Eia860Settings
object manually, and hand that in to the pudl.extract.eia860.Extractor.extract()
method to see what happens. You should get back a dictionary of dataframes where the keys are the page names, including your newly defined page, and the values are dataframes containing the concatenated data across all years available.transform()
function and the other functions in any of the sub-packages named by data source under src/pudl/transform
to see how your new table / function should be integrated.fields.py
report_date
, plant_id_eia
, generator_id
) are compiled together, and the most likely to be correct value is selected.src/pudl/metadata/resources/__init__.py
. It indicates which attributes belong to what entities (utilities, plants, generators, boilers...) and whether they are fixed or annual.src/pudl/metadata/fields.py
. They are sorted alphabetically. At the very least they need a globally unique name, a data type (see the examples in there for available types, they're based on the tabular data package standard, and converted to SQLite / Pandas / Arrow types by the code as appropriate), and a description explaining what the column means.src/pudl/metadata/resources/{data source}.py
fields.py
.
We have integrated most of the tables from EIA 860 and 923, but we're still missing several. This issue collects all tables that are still missing, so we can keep track of our progress towards complete data integration.