EPAIPM has to be downloaded for pudl_etl settings/etl_example.yml to run

briannacote commented 4 years ago

Hello!

I don't know if I would call this a bug, just something nice to know.

During my first download of data, I only did the following: pudl_data --sources eia860 eia923 ferc1

When I went to run the example file to create the CSV's I ran into an error: pudl_etl settings/etl_example.yml

The error: (pudl2) admins-MacBook-Pro:pudl2 briannacote$ pudl_etl settings/etl_example.yml 2019-09-19 17:24:09 [ INFO] pudl:84 verifying that the data we need exists in the data store 2019-09-19 17:24:09 [ INFO] pudl.etl:643 reading and validating etl settings 2019-09-19 17:24:09 [ INFO] pudl.load.csv:243 Loading Static IPM Tables regions_entity_epaipm dataframe into CSV 2019-09-19 17:24:09 [ INFO] pudl.extract.epaipm:136 Beginning ETL for EPA IPM. 2019-09-19 17:24:09 [ INFO] pudl.extract.epaipm:64 Extracting data from EPA IPM transmission_single_epaipm spreadsheet. Traceback (most recent call last): File "/Users/briannacote/anaconda3/envs/pudl2/bin/pudl_etl", line 10, in sys.exit(main()) File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/cli.py", line 99, in main clobber=args.clobber) File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/etl.py", line 790, in generate_data_packages pkg_tables = etl_pkg(pkg_settings, pudl_settings, pkg_bundle_dir) File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/etl.py", line 733, in etl_pkg pkg_dir=pkg_dir File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/etl.py", line 473, in _etl_epaipm epaipm_tables, data_dir=data_dir) File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/epaipm.py", line 143, in extract epaipm_raw_dfs = create_dfs_epaipm(files=epaipm_tables, data_dir=data_dir) File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/epaipm.py", line 115, in create_dfs_epaipm data_dir File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/epaipm.py", line 66, in get_epaipm_file full_filename = get_epaipm_name(filename, data_dir) File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/epaipm.py", line 42, in get_epaipm_name name = sorted(epaipm_dir.glob(pattern))[0] IndexError: list index out of range

This makes sense because I didn't download the data. It just caught me off guard at first because I didn't see a place in the etl_example.yml to turn off that data.

Again, I don't think this is a bug. But it might be good to note that by default you do have to have this data downloaded. I didn't see it in the documentation, but may have missed it too. Nothing big, just helps with setup workflow.

Best, Bri

zaneselvans commented 4 years ago

Yes, the example ETL assumes that you have a little bit of the data from each of the possible sources, it's meant to go along with the example ETL instructions in the README (which includes downloading the EPAIPM data), so this failure is expected -- however, it should have failed in the initial check to see whether all the required files are present (so as not to get too far into the whole ETL process before realizing that it's going to fail).

If you want to turn off the EPA IPM ETL, you might just go ahead and copy the YAML file, and replace the list of epaipm_tables with an empty list, and then use that new YAML file. Or if you never want that data you can remove that package specification from the list of what's processed entirely.

zaneselvans commented 4 years ago

@cmgosnell Is the epaipm source checking whether its data is present before running ETL? It seems like it skipped that step here and got into the actual ETL process.

briannacote commented 4 years ago

How do you comment out the epaipm_tables in the file? This is what I see:

  ###########################################################################
  # EPA IPM SETTINGS
  ###########################################################################
  - name: epaipm-example
    title: EPA Integrated Planning Model Example Package
    description: Transmission, load, and other data from the EPA's Integrated Planning Model.
    datasets:
      - epaipm:
          epaipm_tables:
            - transmission_single_epaipm
            - transmission_joint_epaipm
            - load_curves_epaipm
            - plant_region_map_epaipm

Do you just comment out this entire section? Or remove the table lines?

Thanks again.

zaneselvans commented 4 years ago

The # is the comment character. Anything that comes after the # on a line will be ignored. So you'd put # characters before the epaipm_tables line and the subsequent list items, and replace it with

epaipm_tables: []

which is an empty list. (YAML files can either have lists inside square brackets with commas, or they can have the bulleted lists like you see there, depending on which one is more readable). I.e.

epaipm:
  epaipm_tables: []
# epaipm_tables:
#  - transmission_single_epaipm
#  - transmission_joint_epaipm
#  - load_curves_epaipm
#  - plant_region_map_epaipm

briannacote commented 4 years ago

Forgive my bad git editing previously. You got the nice text boxes going.

That makes sense. Since this line "epaipm_tables: []" was not there like it was for other datasets, I assumed it couldn't be empty.

All of this makes sense. It just caught me off guard since I deviated a bit from the examples. Nothing big. Just wanted to note it.

zaneselvans commented 4 years ago

Triple-quotes are your friend! And you can tell it to use language specific syntax highlighting too.

You could also just remove the entire EPAIPM section, or comment it all out, and you'd get the same results. Our goal is to allow any combination of different data sources to be processed -- you should be able to leave any one of them out and still have everything work -- however we're not going to try and support all possible combinations of years / states etc. Right now we're testing to make sure it's possible to bring in everything from each data source and also just the most recent year by itself. But in the data packages we're going to publish, all of the data will be present (which is important, since as you've noticed, in some cases what values you find in the data depends on which collection of years went into processing it).

zaneselvans commented 4 years ago

Edit docs to make it clear that all commands must be run exactly as they are in the example for it to work.

cmgosnell commented 4 years ago

Hey @briannacote! I'm going to close this issue. I'm editing the main settings file so hopefully commenting out the epaipm_tables will be more clear. Plus Issue #370 will hopefully negate this issue in the future.

briannacote commented 4 years ago

Thank you!!

catalyst-cooperative / pudl

EPAIPM has to be downloaded for pudl_etl settings/etl_example.yml to run #417