Closed bendnorman closed 1 year ago
I don't think there's any harvesting that needs to be done on these tables.
There are some extremely useful FERC 714 output tables which we should load into the DB alongside the PudlTabl
outputs, see pudl.output.ferc714.Respondents
and maybe also pudl.analysis.service_territory
Right now only respondent_id_ferc714
and dhpa_ferc714
(and maybe a table linking FERC 714 to EIA) are getting transformed, so to load any of the other tables into the DB in their extracted state, we'll need to define metadata for them. IIRC there are not a crazy number of columns though.
Not sure what the issue is with utc_datetime
failing its CHECK
constraints... I've verified it's NOT NULL
, and there are also no duplicate primary key values.
The table CREATE
statement is below (minus a very long list of timezones...)
CREATE TABLE demand_hourly_pa_ferc714 (
respondent_id_ferc714 INTEGER NOT NULL CHECK (TYPEOF("respondent_id_ferc714") = 'integer'),
report_date DATE CHECK ("report_date" IS DATE("report_date")),
utc_datetime DATETIME NOT NULL CHECK ("utc_datetime" IS DATETIME("utc_datetime")),
timezone VARCHAR(32) CHECK ("timezone" IS NULL OR TYPEOF("timezone") = 'text') CHECK ("timezone" IN (...))
demand_mwh FLOAT CHECK ("demand_mwh" IS NULL OR TYPEOF("demand_mwh") = 'real'),
PRIMARY KEY (respondent_id_ferc714, utc_datetime),
FOREIGN KEY(respondent_id_ferc714) REFERENCES respondent_id_ferc714 (respondent_id_ferc714)
I think this might be the first time we've tried to load a DATETIME
column into the DB (since EPA CEMS goes straight to Parquet) so maybe there's an issue with the check? I think this check formats the utc_datetime
(which is stored as TEXT
by SQLite using ISO-8601) and compares it to the contents of the utc_datetime
column, requiring them to be the same? Maybe there's a rounding error / different number of sigfigs? DATETIME()
formats without fractional seconds.
CHECK ("utc_datetime" IS DATETIME("utc_datetime"))
All other date and time functions can be expressed in terms of
strftime()
:
Function | Equivalent (or nearly) strftime() |
---|---|
date(...) | strftime('%Y-%m-%d', ...) |
time(...) | strftime('%H:%M:%S', ...) |
datetime(...) | strftime('%Y-%m-%d %H:%M:%S', ...) |
julianday(...) | strftime('%J', ...) -- (numeric return) |
unixepoch(...) | strftime('%s', ...) -- (numeric return) |
I commented out the bit of code that adds:
CHECK ("utc_datetime" IS DATETIME("utc_datetime"))
and it was able to load into the DB, so I think that was the issue. Looking at the error message from SQLite it shows 6 trailing zeroes after the decimal in the string representation of the datetime, which wouldn't show up in the DATETIME()
formatted string, so they aren't equal. I tried changing the type from datetime64[ns]
to datetime64[s]
but it didn't change behavior.
Also... it took 20 minutes to load ~15M records which doesn't seem right. The DB grew by 1.7 GB which also seems like kind of a lot. There are only a few columns in this table!
It looks like the extreme slowness was due to the inclusion of all 500+ recognized timezones in the ENUM constraint on the timezone
column. If it's restricted to just the 6 that show up in ferc714
the load time drops to a little under 3 minutes, which is much more reasonable. But it also blows up memory usage to like 15GB because it's trying to load all the data in one go. (We should fix this for the plants_eia860
table too!)
I added chunksize=1_000_000
to the df.load_sql()
call and while it didn't speed the loading up, it kept memory usage down to ~3GB.
Even with the reduced set of timezones to check the new table still takes up ~1.7GB of space.
Having one table that takes 2-3 minutes to load also pretty much guarantees that many other assets will encounter a locked SQLite DB in every run, so addressing #2417 seems urgent..
Maybe we should add a type conversion layer into the SQLiteIOManager
, that can do things like:
datetime64[ns]
columns into an appropriately formatted string from SQLite (whole seconds).
Create dagster assets for the raw and transformed FERC 714 data so the partially cleaned tables can be accessed in the pudl.sqlite DB. Once the ETL has been converted to dagster #2104 we can start to persist partially cleaned tables from new data sources.
Out of Scope
ferc714
tables in the database we'll have to create resource metadata for the raw tables.Resulting Issues
2431