Closed ezwelty closed 2 years ago
There's a subset of the tables which end up with an auto-incrementing surrogate key because they don't have natural primary keys. The one that jumps to mind right now is fuel_receipts_costs_eia923
which I notice isn't on this list, so that's interesting.
The autoincrementing surrogate keys are currently implemented inelegantly, in the process of loading the data packages into SQLite. Previously it was handled automatically by PostgreSQL. We briefly considered adding the autoincrementing surrogate keys inside of the ETL process, but didn't go down that road. Maybe it makes sense to revisit that now?
We need to decide what we're doing with all the IPM tables going forward. We've been talking about deprecating, and this would probably be the time to commit to that.
ludington
or sometimes lundington
is fucking it up for everybody.Side note: there are many foreign key relationships which are not enumerated in the current version of the metadata, which we should absolutely add. They include but probably aren't limited to the following:
the tables which we currently use an auto-incremented id
column for should are the ones in the autoincrement
dictionary in the meta/datapkg/datapackage.json
. I agree w/ zane that a few of these tables should be converted to have true PKs, but for the rest of them their PK should just be the id
column. And truly if you needed a quick implementation for these tables that need some munging to use natural PKs, the PK right now is effectively the id
column
There's a subset of the tables which end up with an auto-incrementing surrogate key because they don't have natural primary keys. The one that jumps to mind right now is
fuel_receipts_costs_eia923
which I notice isn't on this list, so that's interesting.
fuel_receipts_costs_eia923
is unique in that it shows up with "fields": [ "id", ...] and "primaryKey": ["id"]
in datapackage.json
. I presume this was a mistake?
@cmgosnell My understanding is that the id
surrogate column only comes into existence upon database import, and thus after Python ETL and datapackage creation?
No natural primary key and we need a surrogate key
Primary keys are not required. What is the purpose, for PUDL, of using auto-incrementing surrogates as primary keys? They neither ensure row uniqueness nor can they be used for table joins. I feel like the default should either be no primary key or a primary key composed of every field in the table. SQLITE uses ROWID in the absence of an explicit primary key (https://www.sqlite.org/lang_createtable.html#rowid).
If auto-incrementing keys are needed, it seems worth populating these in the ETL so that the Python, data package, and database outputs all match in form and content (at the expense of a larger dataset). This also gives you the option of using the surrogates in foreign keys if you ever dynamically deduce relationships involving these tables.
@zaneselvans I've added the keys you listed in "EIA/EPA Tables that actually do have natural primary keys". I'll address foreign keys in a separate issue.
right now the id
cols get added in the load step. But we could certainly move that to happen in the transform step instead.
It does this by looking at the metadata, which really doesn't happen in the E/T steps right now so the load step felt like a logical place to do it... but that is going to change with this new harvesting process.
@cmgosnell It seems cleaner to export data packages with id
as column and primary key (and dropping the autoincrement
attribute from the package's datapackage.json
), if that is what is also being used in database form. I can amend the harvest process to set the primary key fields as index or, if not provided, rename the default pandas integer index to id
. Then, either way, you just df.to_csv()
.
@zaneselvans All the primary keys in your "EIA/EPA Tables that actually do have natural primary keys" are listed in autoincrement
. @cmgosnell Can you confirm that these tables are in fact ready for their natural primary key?
IIRC @ezwelty I think the fuel_receipts_costs_eia923
table was a special case, where we had to add the surrogate key during the transform step because of the way that we were ripping the coalmines_eia923
table out of it. I could be mistaken on that though.
When I made that list of tables with (apparent) natural keys I was pulling the tables from the DB directly and verifying their uniqueness so... hopefully they really do work.
Again IIRC, a lot of these shenanigans came up because PostgreSQL and SQLite deal with tables lacking explicit primary keys differently, and we were trying to preserve the ability for the datapackages to get loaded into either DB, when we were transitioning away from loading into PostgreSQL. Postgres automatically added the surrogate keys, but SQLite did something different, and it broke stuff. But I don't recall what stuff.
@zaneselvans What about transmission_single_epaipm
and transmission_joint_epaipm
? Do they have a natural primary key?
A review of the resource metadata in
src/pudl/package_data/meta/datapkg/datapackage.json
reveals the following resources are missing primary keys, an attribute required by the new harvest process (#806).Do additional resource primary keys exist elsewhere than
src/pudl/package_data/meta/datapkg/datapackage.json
?Our task is to assign a primary key for each of these resources, and if that isn't possible, determine if and how the harvest process needs to be amended.