catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert FERC-714 ETL to use Dagster #2266

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

Create dagster assets for the raw and transformed FERC 714 data so the partially cleaned tables can be accessed in the pudl.sqlite DB. Once the ETL has been converted to dagster #2104 we can start to persist partially cleaned tables from new data sources.

- [x] Create a `@multi_asset` in `pudl.extract.ferc714` that extracts the raw data.
- [x] Convert `pudl.transform.ferc714.respondent_id_ferc714()` into an `@asset`
- [x] Convert `pudl.transform.ferc714.demand_hourly_pa_ferc714()` into an `@asset`
- [x] Ensure `respondent_id_ferc714` loads into SQLite
- [x] Ensure `demand_hourly_pa_ferc714` loads into SQLite.
- [x] Remove the `PudlTabl.etl_ferc714()` method.
- [x] Remove `PudlTabl.etl_ferc714()` methods for the truly raw tables.
- [x] Update `PudlTabl` methods for finished tables to read from the DB.
- [x] Verify that `pudl.output.ferc714` methods still work.
- [x] Verify that `state_demand` still works.
- [x] Update FERC-714 output tests to use the DB.
- [x] Integrate FERC-714 into our overall system of ETL settings.
- [x] Update README and RTD documentation to reflect availability of FERC-714.
- [x] Update release notes.
- [x] Remove "(preliminary)" qualifiers next to EIA-861/FERC-714 in docs Intro.
- [x] Add description of Respondent ID table in FERC-714 data source docs page.
- [x] Remove `extract_ferc714()` warning about experimental status.
- [x] Check whether `convert_dyptes()` is doing anything useful in `PudlTabl` outputs.
- [x] Reduce memory usage for reading `demand_hourly_pa_ferc714` to allow CI to run.
- [x] Sort outs and returned Outputs by key in FERC-714 extract multi-asset.

Out of Scope

Resulting Issues

zaneselvans commented 1 year ago

I don't think there's any harvesting that needs to be done on these tables.

There are some extremely useful FERC 714 output tables which we should load into the DB alongside the PudlTabl outputs, see pudl.output.ferc714.Respondents and maybe also pudl.analysis.service_territory

Right now only respondent_id_ferc714 and dhpa_ferc714 (and maybe a table linking FERC 714 to EIA) are getting transformed, so to load any of the other tables into the DB in their extracted state, we'll need to define metadata for them. IIRC there are not a crazy number of columns though.

zaneselvans commented 1 year ago

Not sure what the issue is with utc_datetime failing its CHECK constraints... I've verified it's NOT NULL, and there are also no duplicate primary key values.

``` sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) CHECK constraint failed: utc_datetime [SQL: INSERT INTO demand_hourly_pa_ferc714 (respondent_id_ferc714, report_date, utc_datetime, timezone, demand_mwh) VALUES (?, ?, ?, ?, ?)] [parameters: ((206, '2006-01-01', '2006-01-01 08:00:00.000000', 'America/Los_Angeles', 204.0), (206, '2006-01-01', '2006-01-02 08:00:00.000000', 'America/Los_Angeles', 193.0), (206, '2006-01-01', '2006-01-03 08:00:00.000000', 'America/Los_Angeles', 218.0), (206, '2006-01-01', '2006-01-04 08:00:00.000000', 'America/Los_Angeles', 230.0), (206, '2006-01-01', '2006-01-05 08:00:00.000000', 'America/Los_Angeles', 234.0), (206, '2006-01-01', '2006-01-06 08:00:00.000000', 'America/Los_Angeles', 235.0), (206, '2006-01-01', '2006-01-07 08:00:00.000000', 'America/Los_Angeles', 229.0), (206, '2006-01-01', '2006-01-08 08:00:00.000000', 'America/Los_Angeles', 217.0) ... displaying 10 of 100000 total bound parameter sets ... (251, '2007-01-01', '2007-12-20 05:00:00.000000', 'America/New_York', 3214.0), (251, '2007-01-01', '2007-12-21 05:00:00.000000', 'America/New_York', 2840.0))] (Background on this error at: https://sqlalche.me/e/14/gkpj) ```

The table CREATE statement is below (minus a very long list of timezones...)

CREATE TABLE demand_hourly_pa_ferc714 (
    respondent_id_ferc714 INTEGER NOT NULL CHECK (TYPEOF("respondent_id_ferc714") = 'integer'),
    report_date DATE CHECK ("report_date" IS DATE("report_date")),
    utc_datetime DATETIME NOT NULL CHECK ("utc_datetime" IS DATETIME("utc_datetime")),
    timezone VARCHAR(32) CHECK ("timezone" IS NULL OR TYPEOF("timezone") = 'text') CHECK ("timezone" IN (...))
    demand_mwh FLOAT CHECK ("demand_mwh" IS NULL OR TYPEOF("demand_mwh") = 'real'),
    PRIMARY KEY (respondent_id_ferc714, utc_datetime),
    FOREIGN KEY(respondent_id_ferc714) REFERENCES respondent_id_ferc714 (respondent_id_ferc714)

I think this might be the first time we've tried to load a DATETIME column into the DB (since EPA CEMS goes straight to Parquet) so maybe there's an issue with the check? I think this check formats the utc_datetime (which is stored as TEXT by SQLite using ISO-8601) and compares it to the contents of the utc_datetime column, requiring them to be the same? Maybe there's a rounding error / different number of sigfigs? DATETIME() formats without fractional seconds.

CHECK ("utc_datetime" IS DATETIME("utc_datetime"))

From the SQLite docs:

All other date and time functions can be expressed in terms of strftime():

Function Equivalent (or nearly) strftime()
date(...) strftime('%Y-%m-%d', ...)
time(...) strftime('%H:%M:%S', ...)
datetime(...)  strftime('%Y-%m-%d %H:%M:%S', ...)
julianday(...) strftime('%J', ...) -- (numeric return)
unixepoch(...)  strftime('%s', ...) -- (numeric return)
zaneselvans commented 1 year ago

I commented out the bit of code that adds:

CHECK ("utc_datetime" IS DATETIME("utc_datetime"))

and it was able to load into the DB, so I think that was the issue. Looking at the error message from SQLite it shows 6 trailing zeroes after the decimal in the string representation of the datetime, which wouldn't show up in the DATETIME() formatted string, so they aren't equal. I tried changing the type from datetime64[ns] to datetime64[s] but it didn't change behavior.

Also... it took 20 minutes to load ~15M records which doesn't seem right. The DB grew by 1.7 GB which also seems like kind of a lot. There are only a few columns in this table!

It looks like the extreme slowness was due to the inclusion of all 500+ recognized timezones in the ENUM constraint on the timezone column. If it's restricted to just the 6 that show up in ferc714 the load time drops to a little under 3 minutes, which is much more reasonable. But it also blows up memory usage to like 15GB because it's trying to load all the data in one go. (We should fix this for the plants_eia860 table too!)

I added chunksize=1_000_000 to the df.load_sql() call and while it didn't speed the loading up, it kept memory usage down to ~3GB.

Even with the reduced set of timezones to check the new table still takes up ~1.7GB of space.

Having one table that takes 2-3 minutes to load also pretty much guarantees that many other assets will encounter a locked SQLite DB in every run, so addressing #2417 seems urgent..

zaneselvans commented 1 year ago

Maybe we should add a type conversion layer into the SQLiteIOManager, that can do things like: