catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

LMP data availability #413

Closed briannacote closed 4 years ago

briannacote commented 4 years ago

Hello there!

We have a few quick questions around the LMP data that you are working on wrangling.

Can you share the sources of that data for the various ISOs? It would be great to know that. I know it can be quite complicated or different depending on the ISO.

Do you know off-hand what the historical availability is for this data? For instance, for PJM through Data Miner 2 at least, it seems LMP data only goes back to 2018 for certain granularities. Having some idea of what might be available in the future will be helpful.

When might the LMP data start to be available through PUDL roughly. I am sure this information is somewhere on GIT and I just missed it. Basically we are curious if it is a 2019 thing or beyond. We aren't looking for a set day or anything.

We'll be checking out the new setup soon (RIP Postgres). I'll probably have questions. But in the meantime, congrats on getting through (or almost through) that big change.

Thank you very much! Bri

zaneselvans commented 4 years ago

Hey there. We haven't yet started integrating the LMP data, but it is part of the Sloan grant that we are currently working on, which runs through April of 2020, so we have to have some version of it integrated by then.

We'll start by looking at the work that Patrick Brown at MIT has done, as well as the PyISO project from WattTime. I believe Patrick was able to get close to a decade of data compiled. We've been assuming that we'll need to access the data directly through whatever web interface or API the various ISOs provide.

If you did need to stick with PostgreSQL, that should also be doable -- the script which takes the data packages and populates the SQLite database is ultimately relying on SQLAlchemy, which uses relatively similar interfaces for a variety of database back ends, and I think PostgreSQL is a superset of SQLite... so it might take some minor tweaks, but you could probably create a datapkg_to_postgres script based on the datapkg_to_sqlite script without too much pain.

briannacote commented 4 years ago

Hi Zane, thank you for this information and the fast response. It's good to know for planning and usage.

I will check out the data package you are talking about. Since we already have things built up over the PostGres setup that would make things easiest. But we will take the time to evaluate what is best and decide if we should change for any reason as well.

Much appreciated! I'll go ahead and close this out.

briannacote commented 4 years ago

As a follow-up from yesterday, I did take a stab at creating a "datapkg_to_postgres".

Under the pudl_settings I created the last entry below:

{'pudl_in': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2',
 'data_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/data',
 'settings_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/settings',
 'pudl_out': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2',
 'sqlite_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/sqlite',
 'parquet_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/parquet',
 'datapackage_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/datapackage',
 'notebook_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/notebook',
 'ferc1_db': 'sqlite:////Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/sqlite/ferc1.sqlite',
 'pudl_db': 'sqlite:////Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/sqlite/pudl.sqlite',
 'pudl_db_postgres': 'postgresql://user:password@localhost:5432/postgres'} 

I had to do a pip install psycopg2.

I ran the following script: datapkg_to_sqlite --pkg_bundle_name pudl-example.

And received the following error:

(pudl2) admins-MacBook-Pro:pudl2 briannacote$ datapkg_to_sqlite --pkg_bundle_name pudl-example
2019-09-20 14:44:22 [    INFO] pudl:113 Determining PUDL data management environment.
2019-09-20 14:44:22 [    INFO] pudl:116 pudl_in=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 14:44:22 [    INFO] pudl:117 pudl_out=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 14:44:22 [    INFO] pudl:121 Flattening datapackages within pudl-example.
2019-09-20 14:44:22 [    INFO] pudl.convert.flatten_datapkgs:190 Checking for matching ETL parameters across data packages
2019-09-20 14:44:23 [    INFO] pudl.load.metadata:524 Validating the data package...
2019-09-20 14:44:27 [    INFO] pudl.load.metadata:530 Congrats! You made a valid data package!
2019-09-20 14:44:27 [    INFO] pudl:128 Converting flattened datapackage into an SQLite database.
2019-09-20 14:44:27 [    INFO] pudl.convert.datapkg_to_sqlite:49 Dropping the current PUDL DB, if it exists.
Traceback (most recent call last):
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.ActiveSqlTransaction: VACUUM cannot run inside a transaction block

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/briannacote/anaconda3/envs/pudl2/bin/datapkg_to_sqlite", line 10, in <module>
    sys.exit(main())
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 132, in main
    pkg_name='pudl-all')
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 52, in pkg_to_sqlite_db
    pudl.helpers.drop_tables(pudl_engine)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/helpers.py", line 746, in drop_tables
    conn.execute("VACUUM")
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 982, in execute
    return self._execute_text(object_, multiparams, params)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1155, in _execute_text
    parameters,
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
    e, statement, parameters, cursor, context
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InternalError: (psycopg2.errors.ActiveSqlTransaction) VACUUM cannot run inside a transaction block

[SQL: VACUUM]
(Background on this error at: http://sqlalche.me/e/2j85)

I do have a screenshot attached pointing out the changes made. I can send any file diffs if that is helpful.

Some basic googling seems to point to the isolation level with dropping a database. I did try a few things around this (setting it to zero for instance). But didn't get much further. Am I on the right path here so far? Would you have any insight into the error above as well?

Screen Shot 2019-09-20 at 3 18 20 PM
briannacote commented 4 years ago

I got a little farther on this to let you know. I added:

pudl_engine = sa.create_engine(pudl_settings['pudl_db_postgres'],isolation_level="AUTOCOMMIT")

I also updated this back to storage='sql':

        pkg.save(storage='sql', engine=pudl_engine, merge_groups=True,
                 autoincrement=autoincrement)

The new error I received is here:

(pudl2) admins-MacBook-Pro:pudl2 briannacote$ datapkg_to_sqlite --pkg_bundle_name pudl-example
2019-09-20 16:59:25 [    INFO] pudl:110 Determining PUDL data management environment.
2019-09-20 16:59:25 [    INFO] pudl:113 pudl_in=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 16:59:25 [    INFO] pudl:114 pudl_out=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 16:59:25 [    INFO] pudl:118 Flattening datapackages within pudl-example.
2019-09-20 16:59:25 [    INFO] pudl.convert.flatten_datapkgs:190 Checking for matching ETL parameters across data packages
2019-09-20 16:59:26 [    INFO] pudl.load.metadata:524 Validating the data package...
2019-09-20 16:59:30 [    INFO] pudl.load.metadata:530 Congrats! You made a valid data package!
2019-09-20 16:59:30 [    INFO] pudl:125 Converting flattened datapackage into an SQLite database.
2019-09-20 16:59:30 [    INFO] pudl.convert.datapkg_to_sqlite:48 Dropping the current PUDL DB, if it exists.
2019-09-20 16:59:30 [    INFO] pudl.convert.datapkg_to_sqlite:70 Loading the data package into SQLite.
2019-09-20 16:59:30 [    INFO] pudl.convert.datapkg_to_sqlite:71 If you're loading EPA CEMS, this could take a while.
2019-09-20 16:59:30 [    INFO] pudl.convert.datapkg_to_sqlite:72 It might be a good time to get lunch, or go for a bike ride.
Traceback (most recent call last):
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.InvalidForeignKey: there is no unique constraint matching given keys for referenced table "generators_entity_eia"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/briannacote/anaconda3/envs/pudl2/bin/datapkg_to_sqlite", line 10, in <module>
    sys.exit(main())
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 129, in main
    pkg_name='pudl-all')
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 76, in pkg_to_sqlite_db
    autoincrement=autoincrement)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/datapackage/package.py", line 268, in save
    storage.create(buckets, schemas, force=True)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/tableschema_sql/storage.py", line 109, in create
    self.__metadata.create_all()
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 4294, in create_all
    ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1615, in _run_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 132, in traverse_single
    return meth(obj, **kw)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/ddl.py", line 781, in visit_metadata
    _is_metadata_operation=True,
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 132, in traverse_single
    return meth(obj, **kw)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/ddl.py", line 826, in visit_table
    include_foreign_key_constraints,
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/ddl.py", line 72, in _execute_on_connection
    return connection._execute_ddl(self, multiparams, params)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1050, in _execute_ddl
    compiled,
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
    e, statement, parameters, cursor, context
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InvalidForeignKey) there is no unique constraint matching given keys for referenced table "generators_entity_eia"

[SQL: 
CREATE TABLE ownership_eia860 (
    id SERIAL NOT NULL, 
    report_date DATE, 
    utility_id_eia INTEGER, 
    plant_id_eia INTEGER, 
    generator_id TEXT, 
    operational_status_code TEXT, 
    owner_utility_id_eia INTEGER, 
    owner_name TEXT, 
    owner_state TEXT, 
    owner_city TEXT, 
    owner_street_address TEXT, 
    owner_zip_code TEXT, 
    fraction_owned NUMERIC, 
    PRIMARY KEY (id), 
    FOREIGN KEY(utility_id_eia) REFERENCES utilities_entity_eia (utility_id_eia), 
    FOREIGN KEY(plant_id_eia) REFERENCES generators_entity_eia (plant_id_eia), 
    FOREIGN KEY(generator_id) REFERENCES generators_entity_eia (generator_id)
)

]
(Background on this error at: http://sqlalche.me/e/f405)

Steps closer. It does make the tables it seems in my postgres database. And it does add in data to some of the tables. But for other tables you can see there is no data yet (because of the error above).

zaneselvans commented 4 years ago

Without looking into it deeply, my guess is this has to do with the autoincrement parameter -- PostgreSQL and SQLite create autoincrementing surrogate keys slightly differently (for tables where there's no naturally occurring collection of fields that will uniquely identify every record). We had to give SQLite a list of autoincrementing keys (that's what the autoincrement parameter is in the call to pkg.save() You might try just not including that parameter and see what happens? PostgreSQL is smarter about autoincrementing keys in general. Or it might not work w/o additional re-jiggering of the datapackage library and/or the metadata in the datapackage, at which point it could be less trouble to see if the SQLite DB works for you.

zaneselvans commented 4 years ago

You might also want to check out this page on how to format and highlight code on Github.

briannacote commented 4 years ago

The autoincrement being removed did not make a difference unfortunately. Worth a try.

It was certain foreignKey sets in the datapackage.json files for certain tables that were causing issues. When removed this process will then work.

My understanding of the foreignKey restriction is that nothing will get added into certain tables that aren't reference-able in other tables. But this concept is quite new to me and I may be offbase.

As a specific example, all the foreignKeys for generators_eia860 were removed from the datapackage.json file. This included removal of:

               "foreignKeys": [
                    {
                        "fields": "plant_id_eia",
                        "reference": {
                            "resource": "generators_entity_eia",
                            "fields": "plant_id_eia"
                        }
                    },
                    {
                        "fields": "generator_id",
                        "reference": {
                            "resource": "generators_entity_eia",
                            "fields": "generator_id"
                        }
                    }
                ],
zaneselvans commented 4 years ago

Without the foreign key relationships, you won't have constraints which relate the different tables to each other -- in this case you won't know whether plant and generator ids in the generators_eia860 table are valid, and tables that rely on the entity tables to relate to each other may end disconnected from each other.

briannacote commented 4 years ago

Why would this not work with Postgres?

On Sep 23, 2019 at 4:37 PM, Zane Selvans notifications@github.com wrote:

Without the foreign key relationships, you won't have constraints which relate the different tables to each other -- in this case you won't know whether plant and generator ids in the generators_eia860 table are valid, and tables that rely on the entity tables to relate to each other may end disconnected from each other.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/413?email_source=notifications&email_token=AME3X4PBIAVSUU2ND53V4YDQLFAK5A5CNFSM4IYPIIJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MPHQA#issuecomment-534311872, or mute the thread https://github.com/notifications/unsubscribe-auth/AME3X4KQFXUFFBDLSXEKUZTQLFAK5ANCNFSM4IYPIIJQ .

zaneselvans commented 4 years ago

I'm not sure why this isn't working, but it seems like it's not a quick fix, so you'll probably want to work with the SQLite DB. Right now we don't have any plans to support using PostgreSQL. Too many users were unable to set up a separate database server, and SQLite is supported directly through the Python standard library, so it doesn't depend on any external software setup.

briannacote commented 4 years ago

Unfortunately, from our side, it's not a simple switch for us to do this with the way we built things up around Postgres either. That's why I was pushing down this direction to see if a Postgres database could still be created. Usually, with workflow upgrades there are ways to accommodate previous setups for early adopters. But that is not the case here. Thanks for the help that was provided. If there is a way to support Postgres in the near future....I'll put a flag in for that.

zaneselvans commented 4 years ago

It may also be an issue with the way the datapackage-to-DB loading works differently on SQLite and PostgreSQL -- in theory it's certainly doable, we just don't have the resources to try and manage both systems.

@roll do you have any insight into why this isn't "just working?" Is the SQLite / PostgreSQL handling of foreign keys different on the datapackage storage backend?

It may also be possible to dump an entire SQLite database into Postgres directly, but that's getting pretty indirect.

zaneselvans commented 4 years ago

@briannacote is it possible that the old postgresql database is still hanging around, partially populated, and it's finding duplicate entries because there's more than one copy of the data in the table the foreign key is referring to?