Migrate powerplants_usa project to new `build_powerplants` task

jpvelez commented 1 month ago

Feature Request

PyPSA-USA currently uses the powerplants-usa project to generate its powerplants table, by pulling in data from EIA860, EGRID, and WECC ADS.

We need to migrate this code to a new build_powerplants task, and pull whatever data we can from PUDL. EIA860 docs are here. (See #311)

Suggested Solution

[ ] Port code from build_plant_data.ipynb notebook in powerplant_usa to new build_powerplant.py module in workflow/scripts
[ ] Call module from new build_powerplants rule
[ ] Identify which EIA860 tables to use from PUDL database, grab table names
- PUDL db which will be made available by new retrieve_pudl task implemented in #311
- start by investigating out_eia860_* tables here
[ ] Query those tables from build_powerplants.py
[ ] Determine if PUDL contains EGRID or WECC ADS data, and port those too?

jpvelez commented 1 month ago

Implementation question: should the new build_powerplants rule go in a new build_powerplants.smk file in workflow/rules, should it go in an existing .smk file?

ktehranchi commented 1 month ago

Awesome, thanks for making this.

build_powerplants rule should go in the build_electricity.smk, and the retrieve_pudl rule should go in retrieve.smk

Determine if PUDL contains EGRID or WECC ADS data, and port those too?

I know PDL has the WECC ADS in the GridPath RA toolkit. However I am not sure if it has EGRID values. We can ask @catalyst-cooperative if PUDL has unit level heat-rates calculated from either e-grid or another source.

If pudl doesn't have that we can add another retrieve rule for e-grid or we can switch to using this attached file which a member of my lab created using CEMS data.

CEMS_derive_heat_rates.xlsx

jpvelez commented 1 month ago

Makes sense! Whenever possible, we should think "data pipelines" instead of one-off, hand-processed datasets, IMO. (And ideally those can be part of PUDL, assuming those folks agree.)

That way other projects can benefit, the data gets updated, you can audit the data cleaning logic, etc.

ktehranchi commented 1 month ago

Totally agree- maybe catalyst folks could recommend best way to store single-point heat-rate values calculated from CEMS/egrid within PUDL. Maybe they have this already.

zaneselvans commented 1 month ago

We have a very wide output table full of generator attributes, including unit heat rates: out_eia__monthly_generators. These are derived from the EIA-923 rather than EPA CEMS, but we've definitely thought about creating an aggregated heat rate output from CEMS. If you've already got a method you like, maybe that could be turned into a new PUDL table that's dowstream of CEMS?

ktehranchi commented 1 month ago

Oooo nice! Should these monthly heat rates and fuel costs be interpreted as the HR/FC for the month of X report date? or are these calculated as HR/FC for the last year of reported data for a given plant?

Noting this could be a fix for #241 as well- and help improve our build_fuel_costs module.

If you've already got a method you like, maybe that could be turned into a new PUDL table that's dowstream of CEMS?

I will check with the person who wrote that CEMS code if he has compared the results with 923. He did mention CEMS is preferred to egrid for heat rates....

zaneselvans commented 1 month ago

@ktehranchi these should be interpreted as the average heat and fuel costs for the month, though that's not always 100% true. There are some plants that only report annual totals, and those have to be distributed across all of the months somehow. Additionally, the fuel costs are based on the fuel deliveries in a given month, which may or may not correspond directly to the fuel that was actually burned in that month (especially in the case of solid and liquid fuels that can be stored)

Additionally, there are some significant fuel cost redactions, particularly for natural gas in the northeast, which is unfortunate since the prices there have a big seasonal variability. We use state/region average fuel prices from the bulk EIA aggregated data to fill in as many of those gaps as we can, but it's not complete or perfect.

stephendeyoung commented 1 month ago

FYI: I've started to work on this here: https://github.com/stephendeyoung/pypsa-usa/commit/470ce5c209c654367d6cd8265092b9db126a6181

stephendeyoung commented 1 month ago

I've continued porting the code from the powerplants_usa github repo to this one. I've created a notebook so I can compare the results of the ported code with the results in the build_plant_data.ipynb notebook. @ktehranchi can you check that this is looking ok so far? I noticed that the saved results in the build_plant_data.ipynb github repo are not the same as when I run it locally. When I run locally I only get results for storage generators. Just wanted to make sure that was expected.

A few other issues I came across:

The stable build of the pudl sqlite db does not have the core_eia860__scd_generators_energy_storage table so I've had to use a nightly version
Later report entries in the generators tables have null values that used to be non-null. I started a discussion about this here.

ktehranchi commented 1 month ago

Replacing cell 7 with the code below fixes the issue with only storage devices presenting after the join. Issue is with the INNER join as opposed to LEFT JOIN since SCD_gens contains all gens but storage will only contain storage devices.

eia_data_operable = pd.read_sql_query(
    """
    SELECT
        core_eia860__scd_generators.plant_id_eia,
        _out_eia__yearly_generators.plant_name_eia,
        core_eia860__scd_generators.generator_id,
        core_eia860__scd_generators.capacity_mw,
        core_eia860__scd_generators.summer_capacity_mw,
        core_eia860__scd_generators.winter_capacity_mw,
        core_eia860__scd_generators.minimum_load_mw,
        core_eia860__scd_generators.energy_source_code_1,
        core_eia860__scd_generators.technology_description,
        core_eia860__scd_generators.operational_status_code,
        core_eia860__scd_generators.prime_mover_code,
        core_eia860__scd_generators.planned_generator_retirement_date,
        core_eia860__scd_generators.energy_storage_capacity_mwh,
        _out_eia__yearly_generators.generator_operating_date,
        core_eia860__scd_generators_energy_storage.max_charge_rate_mw,
        core_eia860__scd_generators_energy_storage.max_discharge_rate_mw,
        core_eia860__scd_generators_energy_storage.storage_technology_code_1,
        core_eia860__scd_plants.nerc_region,
        core_eia860__scd_plants.balancing_authority_code_eia,
        _out_eia__yearly_generators.state,
        _out_eia__yearly_generators.latitude,
        _out_eia__yearly_generators.longitude
    FROM core_eia860__scd_generators
    LEFT JOIN _out_eia__yearly_generators ON core_eia860__scd_generators.plant_id_eia = _out_eia__yearly_generators.plant_id_eia AND core_eia860__scd_generators.generator_id = _out_eia__yearly_generators.generator_id
    LEFT JOIN core_eia860__scd_generators_energy_storage ON core_eia860__scd_generators.plant_id_eia = core_eia860__scd_generators_energy_storage.plant_id_eia AND core_eia860__scd_generators.generator_id = core_eia860__scd_generators_energy_storage.generator_id
    LEFT JOIN core_eia860__scd_plants ON core_eia860__scd_generators.plant_id_eia = core_eia860__scd_plants.plant_id_eia
    GROUP BY core_eia860__scd_generators.plant_id_eia, core_eia860__scd_generators.generator_id
    HAVING MAX(core_eia860__scd_generators.report_date)
    ORDER BY core_eia860__scd_generators.report_date DESC;
    """,
    con,
)

ktehranchi commented 1 month ago

After making this change it still seems like the pudl (left) has more capacity than my previous notebook (right). Looking into it now. I'm guessing its because we don't actually filter out the operable plants from the planned / retired plants.

Screenshot 2024-05-22 at 9 48 08 AM

ktehranchi commented 1 month ago

eia_data_operable = eia_data_operable.query("operational_status == 'existing'")

Will filter out the non-operable plants from the dataframe

stephendeyoung commented 1 month ago

Thanks @ktehranchi. I mentioned previously that later report entries in the generators tables have null values that used to be non-null. Do we need to make sure that if a previous report entry had a non-null value it's still included in the data processing e.g. minimum_load_mw?

stephendeyoung commented 1 month ago

Awesome, thanks for making this.

build_powerplants rule should go in the build_electricity.smk, and the retrieve_pudl rule should go in retrieve.smk

Determine if PUDL contains EGRID or WECC ADS data, and port those too?

I know PDL has the WECC ADS in the GridPath RA toolkit. However I am not sure if it has EGRID values. We can ask @catalyst-cooperative if PUDL has unit level heat-rates calculated from either e-grid or another source.

If pudl doesn't have that we can add another retrieve rule for e-grid or we can switch to using this attached file which a member of my lab created using CEMS data.

CEMS_derive_heat_rates.xlsx

Following my previous comment I need clarification on these points:

Do we need to make sure that if a previous report entry had a non-null value it's still included in the data processing e.g. minimum_load_mw?
WECC ADS - the only gridpath table in the pudl sqlite db is core_gridpathratoolkit__assn_generator_aggregation_group but it doesn't have all the columns referenced in @ktehranchi's notebook. I'm not sure where to get this data
EGRID - can we do anything about this now given the previous discussion in this thread?
What should the output of the build_powerplant.py script be? CSVs of the transformed EIA860, EGRID and WECC ADS data?

zaneselvans commented 1 month ago

The hourly tables were getting too large for the SQLite DB, so now they are being distributed as Apache Parquet files. There's one that contains hourly available capacity factors at the BA level. However we did not integrate the WECC ADS data because it was unclear whether it could be legally redistributed or used by parties other than the WECC membership for the purposes specified on the WECC ADS download page.

ktehranchi commented 1 month ago

Sorry for the delay! I was traveling the last few days.

Do we need to make sure that if a previous report entry had a non-null value it's still included in the data processing e.g. minimum_load_mw?

Yes, we should fillforward any of the null values which were previously non-null. Zane mentioned the "slowly changing dimension" (SCD) table which we can use to fill forward.

WECC ADS - the only gridpath table in the pudl sqlite db is core_gridpathratoolkit__assn_generator_aggregation_group but it doesn't have all the columns referenced in @ktehranchi's notebook. I'm not sure where to get this data

I previously had a retrieve rule to download the public version of the WECC ADS which you don't need NDA for. We can use this or just use the csv file I have in the powerplants_usa repository by moving it into the workflow/repo_data folder. We only need one file from the ADS not the whole thing.

EGRID - can we do anything about this now given the previous discussion in this thread?

Instead of EGRID, I would opt to use the out_eia__monthly_generators table that zane mentioned. I would like to start with an annual average heat-rate from that table. Down the road we can explore using the monthly data. If you want to leave this work for me to complete, I'm happy to do so.

What should the output of the build_powerplant.py script be? CSVs of the transformed EIA860, EGRID and WECC ADS data?

The output should be a single CSV of all generators/storage devices in the EIA860, imputed with heat-rate from PUDL and ramp rate data from the ADS.

stephendeyoung commented 3 weeks ago

@ktehranchi I've ported the code over to the pypsa-usa project now which is in a notebook for now. A few things to point out:

I couldn't find a way of doing the fillforward with sqlite and loading all the data in pandas takes too much memory. I ended up using DuckDB instead to query the data in sqlite and do the fillforward. I find DuckDB quite useful as an alternative to pandas because you can write data manipulation logic in SQL. Are you happy with this approach? I've added duckdb as a dependency.
Regarding the WECC ADS data, you mentioned a rule you had to download a public version of this data. I've removed the if statement here. Is that sufficient?
Regarding EGRID, I've included the unit_heat_rate_mmbtu_per_mwh from out_eia__monthly_generators. But there might be more work here like you were describing. If you look at the end of my notebook you'll see I'm getting different results for the hr_diff column. Should I keep the unit_heat_rate_mmbtu_per_mwh and leave the data manipulation to you?
In the powerplants_usa project you were using eia_ads_generator_mapping_updated.csv to map the WECC ADS data to the EIA generators. I've copied this over as a csv and put it in the repo_data for now. Is that ok or can we download this mapping from somewhere?

If you're happy with all of that I can put it into a script and finalise this work.

ktehranchi commented 3 weeks ago

Excellent, thanks so much for your work on this, Stephen! I'll answer below, but yes, feel free to submit a PR with these, and I can complete the remaining tasks and merge.

I couldn't find a way of doing the fillforward with sqlite and loading all the data in pandas takes too much memory. I ended up using DuckDB instead to query the data in sqlite and do the fillforward. I find DuckDB quite useful as an alternative to pandas because you can write data manipulation logic in SQL. Are you happy with this approach? I've added duckdb as a dependency.

Yep, that works great for me!

Regarding the WECC ADS data, you mentioned a rule you had to download a public version of this data. I've removed the if statement here. Is that sufficient?

Yep, this works!

Regarding EGRID, I've included the unit_heat_rate_mmbtu_per_mwh from out_eia__monthly_generators. But there might be more work here like you were describing. If you look at the end of my notebook you'll see I'm getting different results for the hr_diff column. Should I keep the unit_heat_rate_mmbtu_per_mwh and leave the data manipulation to you?

Yep- leaving the unit_heat_rate_mmbtu_per_mwh data sounds good, i can complete the manipulation from there!

In the powerplants_usa project you were using eia_ads_generator_mapping_updated.csv to map the WECC ADS data to the EIA generators. I've copied this over as a csv and put it in the repo_data for now. Is that ok or can we download this mapping from somewhere?

I made this mapping myself, with some manual matching for plants whose IDs didn't match well. So, as you have already done, we'll keep this in repo_data.

stephendeyoung commented 3 weeks ago

@ktehranchi should be good to go now.

PyPSA / pypsa-usa

Migrate powerplants_usa project to new `build_powerplants` task #312

Feature Request

Suggested Solution