Closed briannacote closed 5 years ago
Hey there. We haven't yet started integrating the LMP data, but it is part of the Sloan grant that we are currently working on, which runs through April of 2020, so we have to have some version of it integrated by then.
We'll start by looking at the work that Patrick Brown at MIT has done, as well as the PyISO project from WattTime. I believe Patrick was able to get close to a decade of data compiled. We've been assuming that we'll need to access the data directly through whatever web interface or API the various ISOs provide.
If you did need to stick with PostgreSQL, that should also be doable -- the script which takes the data packages and populates the SQLite database is ultimately relying on SQLAlchemy, which uses relatively similar interfaces for a variety of database back ends, and I think PostgreSQL is a superset of SQLite... so it might take some minor tweaks, but you could probably create a datapkg_to_postgres
script based on the datapkg_to_sqlite
script without too much pain.
Hi Zane, thank you for this information and the fast response. It's good to know for planning and usage.
I will check out the data package you are talking about. Since we already have things built up over the PostGres setup that would make things easiest. But we will take the time to evaluate what is best and decide if we should change for any reason as well.
Much appreciated! I'll go ahead and close this out.
As a follow-up from yesterday, I did take a stab at creating a "datapkg_to_postgres
".
Under the pudl_settings I created the last entry below:
{'pudl_in': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2',
'data_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/data',
'settings_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/settings',
'pudl_out': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2',
'sqlite_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/sqlite',
'parquet_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/parquet',
'datapackage_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/datapackage',
'notebook_dir': '/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/notebook',
'ferc1_db': 'sqlite:////Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/sqlite/ferc1.sqlite',
'pudl_db': 'sqlite:////Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2/sqlite/pudl.sqlite',
'pudl_db_postgres': 'postgresql://user:password@localhost:5432/postgres'}
I had to do a pip install psycopg2
.
I ran the following script: datapkg_to_sqlite --pkg_bundle_name pudl-example
.
And received the following error:
(pudl2) admins-MacBook-Pro:pudl2 briannacote$ datapkg_to_sqlite --pkg_bundle_name pudl-example
2019-09-20 14:44:22 [ INFO] pudl:113 Determining PUDL data management environment.
2019-09-20 14:44:22 [ INFO] pudl:116 pudl_in=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 14:44:22 [ INFO] pudl:117 pudl_out=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 14:44:22 [ INFO] pudl:121 Flattening datapackages within pudl-example.
2019-09-20 14:44:22 [ INFO] pudl.convert.flatten_datapkgs:190 Checking for matching ETL parameters across data packages
2019-09-20 14:44:23 [ INFO] pudl.load.metadata:524 Validating the data package...
2019-09-20 14:44:27 [ INFO] pudl.load.metadata:530 Congrats! You made a valid data package!
2019-09-20 14:44:27 [ INFO] pudl:128 Converting flattened datapackage into an SQLite database.
2019-09-20 14:44:27 [ INFO] pudl.convert.datapkg_to_sqlite:49 Dropping the current PUDL DB, if it exists.
Traceback (most recent call last):
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
cursor, statement, parameters, context
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.ActiveSqlTransaction: VACUUM cannot run inside a transaction block
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/briannacote/anaconda3/envs/pudl2/bin/datapkg_to_sqlite", line 10, in <module>
sys.exit(main())
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 132, in main
pkg_name='pudl-all')
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 52, in pkg_to_sqlite_db
pudl.helpers.drop_tables(pudl_engine)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/helpers.py", line 746, in drop_tables
conn.execute("VACUUM")
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 982, in execute
return self._execute_text(object_, multiparams, params)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1155, in _execute_text
parameters,
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
e, statement, parameters, cursor, context
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
raise value.with_traceback(tb)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
cursor, statement, parameters, context
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.InternalError: (psycopg2.errors.ActiveSqlTransaction) VACUUM cannot run inside a transaction block
[SQL: VACUUM]
(Background on this error at: http://sqlalche.me/e/2j85)
I do have a screenshot attached pointing out the changes made. I can send any file diffs if that is helpful.
Some basic googling seems to point to the isolation level with dropping a database. I did try a few things around this (setting it to zero for instance). But didn't get much further. Am I on the right path here so far? Would you have any insight into the error above as well?
I got a little farther on this to let you know. I added:
pudl_engine = sa.create_engine(pudl_settings['pudl_db_postgres'],isolation_level="AUTOCOMMIT")
I also updated this back to storage='sql':
pkg.save(storage='sql', engine=pudl_engine, merge_groups=True,
autoincrement=autoincrement)
The new error I received is here:
(pudl2) admins-MacBook-Pro:pudl2 briannacote$ datapkg_to_sqlite --pkg_bundle_name pudl-example
2019-09-20 16:59:25 [ INFO] pudl:110 Determining PUDL data management environment.
2019-09-20 16:59:25 [ INFO] pudl:113 pudl_in=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 16:59:25 [ INFO] pudl:114 pudl_out=/Users/briannacote/Documents/GeneratorDataForWisdomInput/catalyst/pudl2
2019-09-20 16:59:25 [ INFO] pudl:118 Flattening datapackages within pudl-example.
2019-09-20 16:59:25 [ INFO] pudl.convert.flatten_datapkgs:190 Checking for matching ETL parameters across data packages
2019-09-20 16:59:26 [ INFO] pudl.load.metadata:524 Validating the data package...
2019-09-20 16:59:30 [ INFO] pudl.load.metadata:530 Congrats! You made a valid data package!
2019-09-20 16:59:30 [ INFO] pudl:125 Converting flattened datapackage into an SQLite database.
2019-09-20 16:59:30 [ INFO] pudl.convert.datapkg_to_sqlite:48 Dropping the current PUDL DB, if it exists.
2019-09-20 16:59:30 [ INFO] pudl.convert.datapkg_to_sqlite:70 Loading the data package into SQLite.
2019-09-20 16:59:30 [ INFO] pudl.convert.datapkg_to_sqlite:71 If you're loading EPA CEMS, this could take a while.
2019-09-20 16:59:30 [ INFO] pudl.convert.datapkg_to_sqlite:72 It might be a good time to get lunch, or go for a bike ride.
Traceback (most recent call last):
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
cursor, statement, parameters, context
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.InvalidForeignKey: there is no unique constraint matching given keys for referenced table "generators_entity_eia"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/briannacote/anaconda3/envs/pudl2/bin/datapkg_to_sqlite", line 10, in <module>
sys.exit(main())
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 129, in main
pkg_name='pudl-all')
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/pudl/convert/datapkg_to_sqlite.py", line 76, in pkg_to_sqlite_db
autoincrement=autoincrement)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/datapackage/package.py", line 268, in save
storage.create(buckets, schemas, force=True)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/tableschema_sql/storage.py", line 109, in create
self.__metadata.create_all()
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 4294, in create_all
ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1615, in _run_visitor
visitorcallable(self.dialect, self, **kwargs).traverse_single(element)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 132, in traverse_single
return meth(obj, **kw)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/ddl.py", line 781, in visit_metadata
_is_metadata_operation=True,
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 132, in traverse_single
return meth(obj, **kw)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/ddl.py", line 826, in visit_table
include_foreign_key_constraints,
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/sql/ddl.py", line 72, in _execute_on_connection
return connection._execute_ddl(self, multiparams, params)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1050, in _execute_ddl
compiled,
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
e, statement, parameters, cursor, context
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
raise value.with_traceback(tb)
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1249, in _execute_context
cursor, statement, parameters, context
File "/Users/briannacote/anaconda3/envs/pudl2/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InvalidForeignKey) there is no unique constraint matching given keys for referenced table "generators_entity_eia"
[SQL:
CREATE TABLE ownership_eia860 (
id SERIAL NOT NULL,
report_date DATE,
utility_id_eia INTEGER,
plant_id_eia INTEGER,
generator_id TEXT,
operational_status_code TEXT,
owner_utility_id_eia INTEGER,
owner_name TEXT,
owner_state TEXT,
owner_city TEXT,
owner_street_address TEXT,
owner_zip_code TEXT,
fraction_owned NUMERIC,
PRIMARY KEY (id),
FOREIGN KEY(utility_id_eia) REFERENCES utilities_entity_eia (utility_id_eia),
FOREIGN KEY(plant_id_eia) REFERENCES generators_entity_eia (plant_id_eia),
FOREIGN KEY(generator_id) REFERENCES generators_entity_eia (generator_id)
)
]
(Background on this error at: http://sqlalche.me/e/f405)
Steps closer. It does make the tables it seems in my postgres database. And it does add in data to some of the tables. But for other tables you can see there is no data yet (because of the error above).
Without looking into it deeply, my guess is this has to do with the autoincrement
parameter -- PostgreSQL and SQLite create autoincrementing surrogate keys slightly differently (for tables where there's no naturally occurring collection of fields that will uniquely identify every record). We had to give SQLite a list of autoincrementing keys (that's what the autoincrement
parameter is in the call to pkg.save()
You might try just not including that parameter and see what happens? PostgreSQL is smarter about autoincrementing keys in general. Or it might not work w/o additional re-jiggering of the datapackage library and/or the metadata in the datapackage, at which point it could be less trouble to see if the SQLite DB works for you.
You might also want to check out this page on how to format and highlight code on Github.
The autoincrement being removed did not make a difference unfortunately. Worth a try.
It was certain foreignKey sets in the datapackage.json files for certain tables that were causing issues. When removed this process will then work.
My understanding of the foreignKey restriction is that nothing will get added into certain tables that aren't reference-able in other tables. But this concept is quite new to me and I may be offbase.
As a specific example, all the foreignKeys for generators_eia860 were removed from the datapackage.json file. This included removal of:
"foreignKeys": [
{
"fields": "plant_id_eia",
"reference": {
"resource": "generators_entity_eia",
"fields": "plant_id_eia"
}
},
{
"fields": "generator_id",
"reference": {
"resource": "generators_entity_eia",
"fields": "generator_id"
}
}
],
Without the foreign key relationships, you won't have constraints which relate the different tables to each other -- in this case you won't know whether plant and generator ids in the generators_eia860
table are valid, and tables that rely on the entity tables to relate to each other may end disconnected from each other.
Why would this not work with Postgres?
On Sep 23, 2019 at 4:37 PM, Zane Selvans notifications@github.com wrote:
Without the foreign key relationships, you won't have constraints which relate the different tables to each other -- in this case you won't know whether plant and generator ids in the generators_eia860 table are valid, and tables that rely on the entity tables to relate to each other may end disconnected from each other.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/413?email_source=notifications&email_token=AME3X4PBIAVSUU2ND53V4YDQLFAK5A5CNFSM4IYPIIJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MPHQA#issuecomment-534311872, or mute the thread https://github.com/notifications/unsubscribe-auth/AME3X4KQFXUFFBDLSXEKUZTQLFAK5ANCNFSM4IYPIIJQ .
I'm not sure why this isn't working, but it seems like it's not a quick fix, so you'll probably want to work with the SQLite DB. Right now we don't have any plans to support using PostgreSQL. Too many users were unable to set up a separate database server, and SQLite is supported directly through the Python standard library, so it doesn't depend on any external software setup.
Unfortunately, from our side, it's not a simple switch for us to do this with the way we built things up around Postgres either. That's why I was pushing down this direction to see if a Postgres database could still be created. Usually, with workflow upgrades there are ways to accommodate previous setups for early adopters. But that is not the case here. Thanks for the help that was provided. If there is a way to support Postgres in the near future....I'll put a flag in for that.
It may also be an issue with the way the datapackage-to-DB loading works differently on SQLite and PostgreSQL -- in theory it's certainly doable, we just don't have the resources to try and manage both systems.
@roll do you have any insight into why this isn't "just working?" Is the SQLite / PostgreSQL handling of foreign keys different on the datapackage storage backend?
It may also be possible to dump an entire SQLite database into Postgres directly, but that's getting pretty indirect.
@briannacote is it possible that the old postgresql database is still hanging around, partially populated, and it's finding duplicate entries because there's more than one copy of the data in the table the foreign key is referring to?
Hello there!
We have a few quick questions around the LMP data that you are working on wrangling.
Can you share the sources of that data for the various ISOs? It would be great to know that. I know it can be quite complicated or different depending on the ISO.
Do you know off-hand what the historical availability is for this data? For instance, for PJM through Data Miner 2 at least, it seems LMP data only goes back to 2018 for certain granularities. Having some idea of what might be available in the future will be helpful.
When might the LMP data start to be available through PUDL roughly. I am sure this information is somewhere on GIT and I just missed it. Basically we are curious if it is a 2019 thing or beyond. We aren't looking for a set day or anything.
We'll be checking out the new setup soon (RIP Postgres). I'll probably have questions. But in the meantime, congrats on getting through (or almost through) that big change.
Thank you very much! Bri