catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

transform/epacems.py to_timedelta has no "box" argument #533

Closed grgmiller closed 4 years ago

grgmiller commented 4 years ago

Describe the bug

When running pudl_etl on 2015 data, I get the following error:

2020-01-30 17:47:24 [    INFO] pudl.extract.epacems:63 Performing ETL for EPA CEMS hourly AL-2015-12
Traceback (most recent call last):
  File "C:\Users\gmiller7\anaconda3\envs\pudl\Scripts\pudl_etl-script.py", line 9, in <module>
    sys.exit(main())
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\cli.py", line 94, in main
    pudl.etl.generate_data_packages(
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 790, in generate_data_packages
    pkg_tables = etl_pkg(pkg_settings, pudl_settings, pkg_bundle_dir)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 719, in etl_pkg
    tbls = _etl_epacems_pkg(
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 395, in _etl_epacems_pkg
    epacems_tables.append(_etl_epacems_part(part, epacems_years,
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 363, in _etl_epacems_part
    for transformed_df_dict in epacems_transformed_dfs:
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\transform\epacems.py", line 250, in transform
    raw_df.fillna(pc.epacems_columns_fill_na_dict)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\generic.py", line 5118, in pipe
    return com.pipe(self, func, *args, **kwargs)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\common.py", line 466, in pipe
    return func(obj, *args, **kwargs)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\transform\epacems.py", line 46, in fix_up_dates
    + pd.to_timedelta(df["op_hour"], unit="h", box=False)
TypeError: to_timedelta() got an unexpected keyword argument 'box'

It appears that box is a to_datetime argument, not a to_timedelta argument.

Bug Severity

How badly is this bug affecting you?

To Reproduce

pudl_etl settings/2015_data.yml

# This file controls the PUDL ETL process, and is used as input to pudl_etl

pkg_bundle_name: 2015_data
pkg_bundle_settings:
  - name: epacems
    title: EPA Continuous Emissions Monitoring System Hourly
    description: Hourly emissions, power output, heat rates, and other data for most US fossil fuel plants.
    datasets:
      - eia:
          eia923_tables:
            - generation_fuel_eia923
            - boiler_fuel_eia923
            - generation_eia923
            - coalmine_eia923
            - fuel_receipts_costs_eia923
          eia923_years: [2011,2012,2013,2014,2015,2016,2017]
          eia860_tables:
            - boiler_generator_assn_eia860
            - utilities_eia860
            - plants_eia860
            - generators_eia860
            - ownership_eia860
          eia860_years: [2011,2012,2013,2014,2015,2016,2017]
      - epacems:
          epacems_years: [2015]
          epacems_states: [ALL]
          partition:
            hourly_emissions_epacems: epacems_years

Curiously, the ETL works for the first 11 months of AL-2015 data, but this error popped up when working on AL-2015-12

Software Environment?

zaneselvans commented 4 years ago

Hi Greg, this input file looks like it's from the 0.2.0 version of pudl... are you sure you're using 0.3.0? The only reason I know is that the first two elements will need to be datapkg... rather than pkg...

I've seen this box thing or something like it come up before, but it's only ever been a warning in the past, not something that actually crashed the process. I don't suppose you've upgraded to pandas 1.0 have you? They deprecated a bunch of things and we haven't changed our process to work with it yet.

karldw commented 4 years ago

This is a change in pandas -- the box argument was deprecated before, and is removed in pandas v1.0. Quoting the release notes:

Removed the previously deprecated keyword “box” from to_datetime() and to_timedelta(); in addition these now always returns DatetimeIndex, TimedeltaIndex, Index, Series, or DataFrame (GH24486)

zaneselvans commented 4 years ago

Arrrgh, dangit, I commented out the version pinning in setup.py to allow pandas 1.0 as a test -- just to see how it would break -- and forgot to uncomment it before the 0.3.0 release. Maybe I should do a 0.3.1 to fix that.

zaneselvans commented 4 years ago

Oh wait a minute, no I did not -- I was thinking of accidentally allowing it to install on Python 3.8 in the setup.py So... it shouldn't have allowed you to try and use pandas 1.0 alongside PUDL 0.3.0.

grgmiller commented 4 years ago

Ah thanks for the thoughts here. When I check my pudl environment in anaconda navigator, it looks like pandas 1.0.0 is installed, but pandas is not an updatable package, so it looks like I cannot roll it back to 0.25.3 ... not sure why it allowed me to update to 1.0.0 - I don't even have 1.0 on my base environment.

It looks like setup.py does include 'pandas>=0.25,<1.0',

Do you think I can fix this by just updating pudl using?

conda update conda
conda env update pudl

Or will I need to uninstall and reinstall pudl completely?

On a related note - will the datapackages that I created with the previous version of pudl be the same as the datapackages created with 0.3.0, or would you recommend re-ETLing each datapackage using 0.3.0?

grgmiller commented 4 years ago

Actually, digging into this deeper, I also want to confirm that I updated pudl to 0.3.0 correctly. I had 0.2.0 installed, and to update to 0.3.0 I just opened anaconda prompt and ran:

conda update conda
conda env update pudl

Is this all I had to do, or did I miss a critical step here? It looks like my environment.yml file might not have been updated by this command. Currently, it contains:

name: pudl
channels:
  - conda-forge
  - defaults
dependencies:
  - catalystcoop.pudl
  - dask
  - jupyter
  - jupyterlab
  - pip
  - python>=3.7
briannacote commented 4 years ago

I have a similar question to what Greg just posted. What's the best way to update things completely with the new update.

zaneselvans commented 4 years ago

I think that what @grgmiller did should work, but to be totally sure I would wipe the old conda environment, and re-create it like...

conda env remove --name pudl
conda env create --name pudl --file environment.yml

or something like that. You could also explicitly set catalystcoop.pudl=0.3.0 if you wanted to inside the environment file. To check and see what version of everything you have installed within the environment you can do conda list with the environment activated, and it'll show you all the packages installed there and their versions.

The environment.yml file won't get updated (unless you go in and change it) -- it says which packages to install, and may or may not specify their versions. Though when you run conda env update pudl it should try to upgrade the packages in there to the most recent compatible versions.

grgmiller commented 4 years ago

Thank you @zaneselvans. I ran conda env remove --name pudl, added catalystcoop.pudl=0.3.0 to my environment.yml file and then imported that yml file as a new environment in anaconda navigator. The new pudl environment now has pandas set to 0.25.3. I'll try re-running my ETL and see what happens.

grgmiller commented 4 years ago

One note is that following this process did not actually update any of the files in my pudl workspace. So for me, the etl_example.yml was not updated to the newer version with datapkg... instead of pkg..., and none of the example notebooks in my notebooks folder were updated. How would we actually go about updating the files in our workspace?

zaneselvans commented 4 years ago

Yes, if you want it to overwrite your existing files, you'll need to use the --clobber flag -- and it'll wipe them all out, which you might not want to do if you've been editing them.. But you can also run pudl_setup in another directory and it should create new copies of the settings files, notebooks, etc. there.

grgmiller commented 4 years ago

This issue was resolved when I reinstalled PUDL v.0.3.0 and made sure that my pudl environment was using pandas v0.25.3 instead of v1.0.0