Open jpvelez opened 1 month ago
Implementation question: should the new build_powerplants
rule go in a new build_powerplants.smk
file in workflow/rules
, should it go in an existing .smk
file?
Awesome, thanks for making this.
build_powerplants
rule should go in the build_electricity.smk
, and the retrieve_pudl
rule should go in retrieve.smk
Determine if PUDL contains EGRID or WECC ADS data, and port those too?
I know PDL has the WECC ADS in the GridPath RA toolkit. However I am not sure if it has EGRID values. We can ask @catalyst-cooperative if PUDL has unit level heat-rates calculated from either e-grid or another source.
If pudl doesn't have that we can add another retrieve rule for e-grid or we can switch to using this attached file which a member of my lab created using CEMS data.
Makes sense! Whenever possible, we should think "data pipelines" instead of one-off, hand-processed datasets, IMO. (And ideally those can be part of PUDL, assuming those folks agree.)
That way other projects can benefit, the data gets updated, you can audit the data cleaning logic, etc.
Totally agree- maybe catalyst folks could recommend best way to store single-point heat-rate values calculated from CEMS/egrid within PUDL. Maybe they have this already.
We have a very wide output table full of generator attributes, including unit heat rates: out_eia__monthly_generators. These are derived from the EIA-923 rather than EPA CEMS, but we've definitely thought about creating an aggregated heat rate output from CEMS. If you've already got a method you like, maybe that could be turned into a new PUDL table that's dowstream of CEMS?
Oooo nice! Should these monthly heat rates and fuel costs be interpreted as the HR/FC for the month of X report date? or are these calculated as HR/FC for the last year of reported data for a given plant?
Noting this could be a fix for #241 as well- and help improve our build_fuel_costs
module.
If you've already got a method you like, maybe that could be turned into a new PUDL table that's dowstream of CEMS?
I will check with the person who wrote that CEMS code if he has compared the results with 923. He did mention CEMS is preferred to egrid for heat rates....
@ktehranchi these should be interpreted as the average heat and fuel costs for the month, though that's not always 100% true. There are some plants that only report annual totals, and those have to be distributed across all of the months somehow. Additionally, the fuel costs are based on the fuel deliveries in a given month, which may or may not correspond directly to the fuel that was actually burned in that month (especially in the case of solid and liquid fuels that can be stored)
Additionally, there are some significant fuel cost redactions, particularly for natural gas in the northeast, which is unfortunate since the prices there have a big seasonal variability. We use state/region average fuel prices from the bulk EIA aggregated data to fill in as many of those gaps as we can, but it's not complete or perfect.
FYI: I've started to work on this here: https://github.com/stephendeyoung/pypsa-usa/commit/470ce5c209c654367d6cd8265092b9db126a6181
I've continued porting the code from the powerplants_usa github repo to this one. I've created a notebook so I can compare the results of the ported code with the results in the build_plant_data.ipynb notebook. @ktehranchi can you check that this is looking ok so far? I noticed that the saved results in the build_plant_data.ipynb github repo are not the same as when I run it locally. When I run locally I only get results for storage generators. Just wanted to make sure that was expected.
A few other issues I came across:
core_eia860__scd_generators_energy_storage
table so I've had to use a nightly versionReplacing cell 7 with the code below fixes the issue with only storage devices presenting after the join. Issue is with the INNER
join as opposed to LEFT JOIN
since SCD_gens contains all gens but storage will only contain storage devices.
eia_data_operable = pd.read_sql_query(
"""
SELECT
core_eia860__scd_generators.plant_id_eia,
_out_eia__yearly_generators.plant_name_eia,
core_eia860__scd_generators.generator_id,
core_eia860__scd_generators.capacity_mw,
core_eia860__scd_generators.summer_capacity_mw,
core_eia860__scd_generators.winter_capacity_mw,
core_eia860__scd_generators.minimum_load_mw,
core_eia860__scd_generators.energy_source_code_1,
core_eia860__scd_generators.technology_description,
core_eia860__scd_generators.operational_status_code,
core_eia860__scd_generators.prime_mover_code,
core_eia860__scd_generators.planned_generator_retirement_date,
core_eia860__scd_generators.energy_storage_capacity_mwh,
_out_eia__yearly_generators.generator_operating_date,
core_eia860__scd_generators_energy_storage.max_charge_rate_mw,
core_eia860__scd_generators_energy_storage.max_discharge_rate_mw,
core_eia860__scd_generators_energy_storage.storage_technology_code_1,
core_eia860__scd_plants.nerc_region,
core_eia860__scd_plants.balancing_authority_code_eia,
_out_eia__yearly_generators.state,
_out_eia__yearly_generators.latitude,
_out_eia__yearly_generators.longitude
FROM core_eia860__scd_generators
LEFT JOIN _out_eia__yearly_generators ON core_eia860__scd_generators.plant_id_eia = _out_eia__yearly_generators.plant_id_eia AND core_eia860__scd_generators.generator_id = _out_eia__yearly_generators.generator_id
LEFT JOIN core_eia860__scd_generators_energy_storage ON core_eia860__scd_generators.plant_id_eia = core_eia860__scd_generators_energy_storage.plant_id_eia AND core_eia860__scd_generators.generator_id = core_eia860__scd_generators_energy_storage.generator_id
LEFT JOIN core_eia860__scd_plants ON core_eia860__scd_generators.plant_id_eia = core_eia860__scd_plants.plant_id_eia
GROUP BY core_eia860__scd_generators.plant_id_eia, core_eia860__scd_generators.generator_id
HAVING MAX(core_eia860__scd_generators.report_date)
ORDER BY core_eia860__scd_generators.report_date DESC;
""",
con,
)
After making this change it still seems like the pudl (left) has more capacity than my previous notebook (right). Looking into it now. I'm guessing its because we don't actually filter out the operable plants from the planned / retired plants.
eia_data_operable = eia_data_operable.query("operational_status == 'existing'")
Will filter out the non-operable plants from the dataframe
Thanks @ktehranchi. I mentioned previously that later report entries in the generators tables have null values that used to be non-null. Do we need to make sure that if a previous report entry had a non-null value it's still included in the data processing e.g. minimum_load_mw
?
Awesome, thanks for making this.
build_powerplants
rule should go in thebuild_electricity.smk
, and theretrieve_pudl
rule should go inretrieve.smk
Determine if PUDL contains EGRID or WECC ADS data, and port those too?
I know PDL has the WECC ADS in the GridPath RA toolkit. However I am not sure if it has EGRID values. We can ask @catalyst-cooperative if PUDL has unit level heat-rates calculated from either e-grid or another source.
If pudl doesn't have that we can add another retrieve rule for e-grid or we can switch to using this attached file which a member of my lab created using CEMS data.
Following my previous comment I need clarification on these points:
minimum_load_mw
?core_gridpathratoolkit__assn_generator_aggregation_group
but it doesn't have all the columns referenced in @ktehranchi's notebook. I'm not sure where to get this databuild_powerplant.py
script be? CSVs of the transformed EIA860, EGRID and WECC ADS data?The hourly tables were getting too large for the SQLite DB, so now they are being distributed as Apache Parquet files. There's one that contains hourly available capacity factors at the BA level. However we did not integrate the WECC ADS data because it was unclear whether it could be legally redistributed or used by parties other than the WECC membership for the purposes specified on the WECC ADS download page.
Sorry for the delay! I was traveling the last few days.
Do we need to make sure that if a previous report entry had a non-null value it's still included in the data processing e.g. minimum_load_mw?
Yes, we should fillforward any of the null values which were previously non-null. Zane mentioned the "slowly changing dimension" (SCD) table which we can use to fill forward.
WECC ADS - the only gridpath table in the pudl sqlite db is core_gridpathratoolkit__assn_generator_aggregation_group but it doesn't have all the columns referenced in @ktehranchi's notebook. I'm not sure where to get this data
I previously had a retrieve rule to download the public version of the WECC ADS which you don't need NDA for. We can use this or just use the csv file I have in the powerplants_usa
repository by moving it into the workflow/repo_data
folder. We only need one file from the ADS not the whole thing.
EGRID - can we do anything about this now given the previous discussion in this thread?
Instead of EGRID, I would opt to use the out_eia__monthly_generators table that zane mentioned. I would like to start with an annual average heat-rate from that table. Down the road we can explore using the monthly data. If you want to leave this work for me to complete, I'm happy to do so.
What should the output of the build_powerplant.py script be? CSVs of the transformed EIA860, EGRID and WECC ADS data?
The output should be a single CSV of all generators/storage devices in the EIA860, imputed with heat-rate from PUDL and ramp rate data from the ADS.
@ktehranchi I've ported the code over to the pypsa-usa project now which is in a notebook for now. A few things to point out:
if
statement here. Is that sufficient?unit_heat_rate_mmbtu_per_mwh
from out_eia__monthly_generators
. But there might be more work here like you were describing. If you look at the end of my notebook you'll see I'm getting different results for the hr_diff
column. Should I keep the unit_heat_rate_mmbtu_per_mwh
and leave the data manipulation to you?eia_ads_generator_mapping_updated.csv
to map the WECC ADS data to the EIA generators. I've copied this over as a csv and put it in the repo_data
for now. Is that ok or can we download this mapping from somewhere?If you're happy with all of that I can put it into a script and finalise this work.
Excellent, thanks so much for your work on this, Stephen! I'll answer below, but yes, feel free to submit a PR with these, and I can complete the remaining tasks and merge.
I couldn't find a way of doing the fillforward with sqlite and loading all the data in pandas takes too much memory. I ended up using DuckDB instead to query the data in sqlite and do the fillforward. I find DuckDB quite useful as an alternative to pandas because you can write data manipulation logic in SQL. Are you happy with this approach? I've added duckdb as a dependency.
Yep, that works great for me!
Regarding the WECC ADS data, you mentioned a rule you had to download a public version of this data. I've removed the if statement here. Is that sufficient?
Yep, this works!
Regarding EGRID, I've included the unit_heat_rate_mmbtu_per_mwh from out_eia__monthly_generators. But there might be more work here like you were describing. If you look at the end of my notebook you'll see I'm getting different results for the hr_diff column. Should I keep the unit_heat_rate_mmbtu_per_mwh and leave the data manipulation to you?
Yep- leaving the unit_heat_rate_mmbtu_per_mwh data sounds good, i can complete the manipulation from there!
In the powerplants_usa project you were using eia_ads_generator_mapping_updated.csv to map the WECC ADS data to the EIA generators. I've copied this over as a csv and put it in the repo_data for now. Is that ok or can we download this mapping from somewhere?
I made this mapping myself, with some manual matching for plants whose IDs didn't match well. So, as you have already done, we'll keep this in repo_data.
@ktehranchi should be good to go now.
Feature Request
PyPSA-USA currently uses the powerplants-usa project to generate its powerplants table, by pulling in data from EIA860, EGRID, and WECC ADS.
We need to migrate this code to a new build_powerplants task, and pull whatever data we can from PUDL. EIA860 docs are here. (See #311)
Suggested Solution
build_powerplant.py
module inworkflow/scripts
build_powerplants
ruleretrieve_pudl
task implemented in #311out_eia860_*
tables herebuild_powerplants.py