Closed zaneselvans closed 2 years ago
Looking at the schemata for generation_fuel_eia923
and boiler_fuel_eia923
using
sqlite-utils schema pudl.sqlite <table>
I really don't see a difference with respect to how their primary keys are set up... i.e. there are none, so they should both be getting rowid
columns, but only boiler_fuel_eia923
is based on my SELECT
statements.
CREATE TABLE generation_fuel_eia923 (
plant_id_eia INTEGER,
report_date DATE,
nuclear_unit_id INTEGER,
fuel_type TEXT,
fuel_type_code_pudl TEXT,
fuel_type_code_aer TEXT,
prime_mover_code TEXT,
fuel_consumed_units FLOAT,
fuel_consumed_for_electricity_units FLOAT,
fuel_mmbtu_per_unit FLOAT,
fuel_consumed_mmbtu FLOAT,
fuel_consumed_for_electricity_mmbtu FLOAT,
net_generation_mwh FLOAT,
FOREIGN KEY(fuel_type_code_aer) REFERENCES fuel_type_aer_eia923 (abbr),
FOREIGN KEY(fuel_type) REFERENCES fuel_type_eia923 (abbr),
FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date),
FOREIGN KEY(prime_mover_code) REFERENCES prime_movers_eia923 (abbr)
)
CREATE TABLE boiler_fuel_eia923 (
plant_id_eia INTEGER,
boiler_id TEXT,
fuel_type_code TEXT,
fuel_type_code_pudl TEXT,
report_date DATE,
fuel_consumed_units FLOAT,
fuel_mmbtu_per_unit FLOAT,
sulfur_content_pct FLOAT,
ash_content_pct FLOAT,
FOREIGN KEY(plant_id_eia, boiler_id) REFERENCES boilers_entity_eia (plant_id_eia, boiler_id),
FOREIGN KEY(fuel_type_code) REFERENCES fuel_type_eia923 (abbr),
FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date)
)
Doing something similar but pulling from the SQLAlchemy MetaData object that was used to create the database, boy they sure do look the same. Why is only one of them getting a rowid
?
from pudl.metadata.classes import Resource, Package
from pudl.metadata.resources import RESOURCE_METADATA
check_table = "generation_fuel_eia923"
pudl_meta = Package(name="pudl", resources=[Resource.from_id(r) for r in RESOURCE_METADATA]).to_sql()
check_meta = pudl_meta.tables[check_table]
print(sa.schema.CreateTable(pudl_meta.tables["generation_fuel_eia923"]).compile(pudl_engine))
CREATE TABLE generation_fuel_eia923 (
plant_id_eia INTEGER,
report_date DATE,
nuclear_unit_id INTEGER,
fuel_type TEXT,
fuel_type_code_pudl TEXT,
fuel_type_code_aer TEXT,
prime_mover_code TEXT,
fuel_consumed_units FLOAT,
fuel_consumed_for_electricity_units FLOAT,
fuel_mmbtu_per_unit FLOAT,
fuel_consumed_mmbtu FLOAT,
fuel_consumed_for_electricity_mmbtu FLOAT,
net_generation_mwh FLOAT,
FOREIGN KEY(fuel_type_code_aer) REFERENCES fuel_type_aer_eia923 (abbr),
FOREIGN KEY(fuel_type) REFERENCES fuel_type_eia923 (abbr),
FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date),
FOREIGN KEY(prime_mover_code) REFERENCES prime_movers_eia923 (abbr)
)
print(sa.schema.CreateTable(pudl_meta.tables["boiler_fuel_eia923"]).compile(pudl_engine))
CREATE TABLE boiler_fuel_eia923 (
plant_id_eia INTEGER,
boiler_id TEXT,
fuel_type_code TEXT,
fuel_type_code_pudl TEXT,
report_date DATE,
fuel_consumed_units FLOAT,
fuel_mmbtu_per_unit FLOAT,
sulfur_content_pct FLOAT,
ash_content_pct FLOAT,
FOREIGN KEY(plant_id_eia, boiler_id) REFERENCES boilers_entity_eia (plant_id_eia, boiler_id),
FOREIGN KEY(fuel_type_code) REFERENCES fuel_type_eia923 (abbr),
FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date)
)
Maybe I am missing something but looking at the WITHOUT ROWID
docs from SQLite it sounds like it's not supposed to be possible to create a table that has no rowid
and also has no PRIMARY KEY
sooooo... still very confused.
I also note that the new DB that is having this problem is exactly the same size as the previous DB, which didn't have this issue so the data is all in there somewhere, lost, and impossible to query apparently!
-rw-r--r-- 1 zane zane 616669184 Aug 14 20:04 pudl-2021-08-14.sqlite
-rw-r--r-- 1 zane zane 616669184 Sep 7 00:34 pudl.sqlite
Now if I cut-and-paste the CREATE
statements into SQLite, I get empty tables from which I can't select any rowid
values but maybe that's just because they're empty. Creating these tables using the above schemas and inserting a row into each of them interactively works fine. They both end up with an automatically generated rowid
column which serves as a pseudo key for the table.
The empty PrimaryKeyConstraint
seems to be the sqlalchemy
default:
import sqlalchemy as sa
t = sa.Table('table', sa.MetaData(), sa.Column('x'))
t.constraints
# {PrimaryKeyConstraint()}
As for this bizarre and different behavior between different tables lacking primary keys, I'm really not sure. You say that creating and loading data into SQLite is working interactively. So perhaps the problem is in the connection between Python and the database. Can you provide example code that reproduces the diverging behaviors you are describing? (i.e. "When data is inserted into them it seems to vanish")
It turns out this was happening because no data was being written to the tables at all. I messed up the parsing of the ETL parameters and was accidentally excluding the empty tables from ever being written to. And the rowid
field is only created once you've actually inserted rows into the table so... that's why it wasn't showing up. I've fixed the parameter parsing and now it seems to be acting normally again!
The "vanishing data" thing still seems odd to me though, like why was this DB exactly the same size as a DB I had output 3 weeks ago which worked fine? Seems very strange. I must be misunderstanding something.
PUDL DB tables which lack a natural primary key, and thus have no primary key declared in the new metadata structures, end up not having an implicit primary key either when they are created in SQLite. SQLite's normal behavior for a table with no declared primary key is to create a column called
rowid
which is an auto-incrementing integer pseudokey. For some reason that's not happening. I notice that in the table metadata for tables with no declared primary key there's still an explicitPrimaryKey()
(Null args):When data is inserted into them it seems to vanish, so several tables end up empty after the ETL. There are several different observed behaviors, including:
Has a real
PrimaryKeyConstraint
, and the table is full (as expected 😀):generation_eia923
Empty
PrimaryKeyConstraint()
, but the table is full (weird! 👽):boiler_fuel_eia923
(hasrowid
)Empty
PrimaryKeyConstraint()
, and no data (expected but... 😭):fuel_receipts_costs_eia923
(norowid
)generation_fuel_eia923
(norowid
)To see whether a table has a
rowid
I've just been doing:So it seems like we need to make all of the tables that lack an explicit primary key behave like
boiler_fuel_eia923
. Not seeing what's different about it immediately though. In particular it seems identical togeneration_fuel_eia923
in that I commented out their explicit natural primary keys in thepudl.metadata.resources
module because we need to update the transform step to make them unique (See: #851 and #852) so I assume the issue is not in theRESOURCE_METADATA
Looking at the definition of
Resource.to_sql()
I'm also not seeing what would be different between these two resources. @ezwelty is there anything that's obvious to you here?