catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Downloading 2018 EIA 860 and EIA 923 Data #532

Closed briannacote closed 4 years ago

briannacote commented 4 years ago

Describe the bug

This may not be a bug but a user error. It looks like from a recent update to catalystcoop.pudl I can now perform the normal ETL process for 2018 for EIA 860 and EIA 923?

I did the following within the environment I have setup for pudl:

  1. pip install catalystcoop.pudl --upgrade
  2. In settings/etl_bri.yml (my version of etl_example.yml) I set: eia860_years: [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018] eia923_years: [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
  3. I ran pudl_etl settings/etl_bri.yml
  4. I got the following error: Screen Shot 2020-01-30 at 4 46 34 PM

I think #396 is what I was looking for help on.

Perhaps there is something else I also need to update?

Thank you! Brianna

zaneselvans commented 4 years ago

Hey there, yes a couple of the names of the attributes in the settings file have changed. If you replace your top level pkg_* items with datapkg_* it should work. You can also see the example file that is distributed with the new release, which should be using the correct settings.

briannacote commented 4 years ago

Ahh, cool. I just did a pip install catalystcoop.pudl --upgrade.

Are there other suggested steps to getting the latest example files as well? I just want to make sure I update things appropriately for future changes. For example, this was the next thing I ran into around the ferc settings which might also be in the updated example file: Screen Shot 2020-01-30 at 5 22 06 PM

zaneselvans commented 4 years ago

That table has been removed for the moment since with the inclusion of the older years, the row numbers changed, making the multi-year DB invalid -- we have a way to fix it, but need to map all those old rows before it'll be correct. You can remove it from the list of tables that's loaded into the PUDL DB for FERC Form 1 -- run pudl_setup and check out the new distributed example settings file as a basis for the current ETL process -- I'm not sure if there might be other things that have changed in there... not thinking of any off the top of my head though.

briannacote commented 4 years ago

I think Greg just asked the same question. What's the best way to update to the latest version and make sure the latest is being used in all ways? That's what I'm also curious about.

I'll try out the FERC Form 1. But I'm not sure how to check out the latest example settings file. I'm not sure if you mean through GitHub?

zaneselvans commented 4 years ago

Oh sorry, no I didn't mean that kind of "check out" -- I just meant take a look at.

There's a script included with the package called pudl_setup If you run it, it deploys a few files that are part of the package, including an example settings file. That's what I was suggesting you look at.

zaneselvans commented 4 years ago

(though you can certainly also look at the one that's part of the repo, it's in src/pudl/package_data/settings/)

briannacote commented 4 years ago

Hi Zane,

Thanks for the link to the settings files. That helped me get through the FERC issue I shared above. I tried this again and now ran into this issue:

Screen Shot 2020-02-06 at 4 02 49 PM

Is there any information I can provide to help troubleshoot?

zaneselvans commented 4 years ago

We usually see this error when EIA changes the headers / tabs or other formatting in the spreadsheets.

You might want to use the pudl_data script to pull down a fresh copy of the EIA 923 data, at least for 2018 (there was a late addition of some tabs to it even after the supposedly "final" release). That would look something like:

pudl_data --clobber -s eia923 -y 2018

Also, when you use screenshots to post information here, it's not searchable, so anybody else on the internet who is attempting to figure out the same problem won't come across this post. You can cut-and-paste text here and have it show up exactly as it does in the terminal by enclosing it in triple backticks. If it's a language that Github knows about, you can also tell it how the code should be highlighted. Some examples here..

briannacote commented 4 years ago

Sure thing. Here's that last one written out.

Traceback (most recent call last):
  File "/Users/briannacote/anaconda3/envs/pudl2/bin/pudl_etl", line 10, in <module>
    sys.exit(main())
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/cli.py", line 110, in main
    clobber=args.clobber)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/etl.py", line 866, in generate_datapkg_bundle
    datapkg_resources = etl(datapkg_settings, output_dir, pudl_settings)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/etl.py", line 804, in etl
    dataset_dict[dataset], output_dir, pudl_settings)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/etl.py", line 205, in _etl_eia
    eia923_years=eia923_years, data_dir=pudl_settings["data_dir"])
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/eia923.py", line 241, in extract
    years=eia923_years)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/eia923.py", line 161, in get_eia923_page
    to_drop = [c for c in newdata.columns if c[:8] == 'reserved']
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/extract/eia923.py", line 161, in <listcomp>
    to_drop = [c for c in newdata.columns if c[:8] == 'reserved']
TypeError: 'float' object is not subscriptable
zaneselvans commented 4 years ago

Did pulling a new version of the EIA 923 for 2018 fix this?

briannacote commented 4 years ago

Almost! It finished the ETL process. It looks like it created a new directory called datapkg. Before this directory was datapackage. I have python scripts that hook up to the sqlite setup. I have the following:

pudl_settings = pudl.workspace.setup.get_defaults()
print("Attaching to database:",pudl_settings["pudl_db"])
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

It looks like that still directs to the old datapackage directory because when I update my scripts to pull data for 2018, it doesn't happen. Is there a way to update this side as well?

It's super close.

zaneselvans commented 4 years ago

Hmm. That suggests to me that there's another older version of PUDL floating around somewhere that's being used to generate the paths. How are you controlling the installed versions in your other scriptyland?

zaneselvans commented 4 years ago

(v0.3.1 is the most recent release.)

briannacote commented 4 years ago

Hang tight,...it might have just been me being silly. I created the datapackage, but not the sqlite database. I'm running that now.

zaneselvans commented 4 years ago

Yeaaaaaah, I know it's all chopped up into different steps. The data release includes some scripts that just do all the parts. We want to integrate it into a single thing.

briannacote commented 4 years ago

Ok, so that was definitely that last error I ran into. I just needed to follow directions 😄

Here's the next one. Did some table names change? I know you guys provided at some point a quick way to go through an sqlite file with either a Jupyter notebook or something like that, but I can't seem to find it. Am I remembering correctly?

Traceback (most recent call last):
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such column: p.plant_name

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./pull_pudl_data_sqlite.py", line 97, in <module>
    df_860 = pd.read_sql(statement_860, pudl_engine)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 436, in read_sql
    chunksize=chunksize,
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 1218, in read_query
    result = self.execute(*args)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/pandas/io/sql.py", line 1087, in execute
    return self.connectable.execute(*args, **kwargs)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2169, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 982, in execute
    return self._execute_text(object_, multiparams, params)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1155, in _execute_text
    parameters,
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/Users/briannacote/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: p.plant_name
[SQL:  SELECT
        g.plant_id_eia,
        g.generator_id,
        g.technology_description,
        p.plant_name,
        p.county,
        p.state,
        p.latitude,
        p.longitude,
        p.balancing_authority_code,
        g.capacity_mw,
        g.nameplate_power_factor,
        g.summer_capacity_mw,
        g.winter_capacity_mw,
        g.minimum_load_mw,
        pm.operating_date,
        g.retirement_date,
        g.time_cold_shutdown_full_load_code,
        pm.prime_mover_code,
        g.energy_source_code_1
    from generators_eia860 g
inner join plants_entity_eia p on g.plant_id_eia =p.plant_id_eia
join generators_entity_eia pm on (g.plant_id_eia = pm.plant_id_eia and g.generator_id =pm.generator_id)
where
g.report_date = '2017-01-01'
and (p.state is null or p.state not in ('AK', 'HI'))
and g.operational_status_code in ('existing','OP','OS','SB','OA')
order by g.plant_id_eia, g.capacity_mw;
 ]
(Background on this error at: http://sqlalche.me/e/e3q8)
zaneselvans commented 4 years ago

Yes, the column names and coding are definitely still evolving. One thing that changed is because we're preserving plant names from both FERC and EIA and their different we wanted to make it explicit which one was being referred to -- so instead of plant_name now you'll get plant_name_eia or plant_name_ferc1 depending where you're selecting. Not sure if the operational_status_code values changed, but possibly.

There's the pudl_intro.ipynb notebook which pulls some tables as an example and might be helpful. It should get deployed into the notebooks directory in your PUDL data management workspace. Is that what you're thinking of?

zaneselvans commented 4 years ago

We also need to make an easy to read version of all the table metadata and integrate it into the documentation on Read The Docs.

briannacote commented 4 years ago

That is exactly the notebook, thank you! It had been a while since I used it and couldn't remember where it was. I got the new column name set. I'll check out the operational_status_code as I have some work by hand I did with the 2018 data that I can back check with.

2018 is a go! Thank you for the help, Zane!