catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

utc_offset missing from CEMS plants when running pudl_etl #467

Closed grgmiller closed 4 years ago

grgmiller commented 4 years ago

Describe the bug

When running pudl_etl to create a database of CEMS data from 2017, I get the following error: ValueError: utc_offset should never be missing for CEMS plants, but was missing for these: [55422]

Bug Severity

High: This bug is preventing me from using PUDL.

To Reproduce

In anaconda prompt, I enter pudl_etl settings/2017-data.yml Text of the setttings file and console output is attached: issue_output.txt settings file text.txt

The process gets through the ETL for FL-2017-12, takes a dramatic pause, and as soon as it resumes number crunching, it throws the error.

Expected behavior

I expected pudl_etl to create a complete datastore of CEMS data

Software Environment?

Additional context

I have only downloaded eia860, eia923, and epacems, not ferc1 or epaipm

grgmiller commented 4 years ago

This potentially seems related to https://github.com/catalyst-cooperative/pudl/issues/351

karldw commented 4 years ago

Hey @grgmiller, the code is using the EIA 860 data to calculate the utc_offset. A workaround here is to use a wider range of eia860_years in your settings (e.g. [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]).

Ref #178 (plant matching) and #250 (original issue about UTC)

grgmiller commented 4 years ago

Thanks for the suggestion @karldw

I ran the same code again, but this time edited the settings so that eia923_years,eia860_years, and epacems_years are all set to [2016,2017]

At the same point in the process, however, this time when performing ETL for the 2016 CEMs file, I get the same valueerror.

2019-11-05 18:58:40 [    INFO] pudl.extract.epacems:64 Performing ETL for EPA CEMS hourly FL-2016-12
2019-11-05 18:58:46 [    INFO] pudl.load.csv:97 ===================== Dramatic Pause ====================
2019-11-05 18:58:46 [    INFO] pudl.load.csv:99     Loading 5,597,592 records (790 MB) into PUDL.
2019-11-05 19:00:52 [    INFO] pudl.load.csv:106 ================ Resume Number Crunching ================
Traceback (most recent call last):
  File "C:\Users\gmiller7\anaconda3\envs\pudl\Scripts\pudl_etl-script.py", line 9, in <module>
    sys.exit(main())
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\cli.py", line 99, in main
    clobber=args.clobber)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 790, in generate_data_packages
    pkg_tables = etl_pkg(pkg_settings, pudl_settings, pkg_bundle_dir)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 722, in etl_pkg
    pkg_dir=pkg_dir
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 398, in _etl_epacems_pkg
    pkg_dir))
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 363, in _etl_epacems_part
    for transformed_df_dict in epacems_transformed_dfs:
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\transform\epacems.py", line 252, in transform
    .pipe(fix_up_dates, plant_utc_offset=plant_utc_offset)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\generic.py", line 5028, in pipe
    return com._pipe(self, func, *args, **kwargs)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\common.py", line 483, in _pipe
    return func(obj, *args, **kwargs)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\transform\epacems.py", line 55, in fix_up_dates
    f"utc_offset should never be missing for CEMS plants, but was "
ValueError: utc_offset should never be missing for CEMS plants, but was missing for these: [55422]

Do I need to try expanding my year range even more, or is there another workaround available?

zaneselvans commented 4 years ago

I would do all the available years of EIA 860 -- 2011-2017. Many of the static entity (plants, generators, utilities) have inconsistently reported values across years, and we set a consistency threshold for those values, below which they get set to NaN. So if one of the location fields that's being used to infer timezone (and thus UTC offset) is too inconsistent the offest will be unavailable.

We need to make this all more robust -- and really the expectation is that most users will just download the pre-compiled data... once we're publishing it on a regular basis, and not run the whole involved ETL process themselves. But we're not quite there yet.

Also note that you can expand the set of EIA 860 years w/o expanding the years for 923 or CEMS if you don't want to.

grgmiller commented 4 years ago

Thank you @zaneselvans that seems to have done the trick! I edited my settings file to include the following settings: eia923_years: [2017] eia860_years: [2011,2012,2013,2014,2015,2016,2017] epacems_years: [2017] I now have a valid data package for all epacems data for 2017.

briannacote commented 4 years ago

Just to note, this chain helped me as I ran into this issue too. :) Thank you!