catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
465 stars 107 forks source link

Bring Census DP1 database into Dagster #2412

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

We use the Census DP1 geodatabase for spatial analyses (e.g. turning FIPS codes into county geometries) and as a source for demographic information (e.g. population by census tract).

It seems like it may have slipped through the cracks in the Dagster migration? The ad-hoc "ETL" is almost a one-liner, converting the published geodatabase into SQLite it's in pudl.convert.censusdp1tract_to_sqlite.py

The database only has 3-4 tables in it (state, county, and tract-level demographic data). Should it be an asset? A multi-asset? A standalone resource like the FERC DBs?

I'm a little surprised that this didn't break anything in the tests or nightly builds. We currently distributed this converted DB along with the other SQLite DBs if the nightly builds succeed.

### Scope
- [ ] census1dbtract.sqlite db is explicitly generated by dagster, as opposed to by ferc714_out fixture
### Next steps
- [ ] wrap pudl.convert.censusdp1tract_to_sqlite.censusdp1tract_to_sqlite in a @multi_asset
- [ ] new IOManager for censusdp1tract DB
bendnorman commented 1 year ago

The most recent dagster-asset-etl nightly build included the census1dbtract.sqlite db. The gcp_pudl_et.sh script on a dev and dagster-asset-etl doesn't create the db using the censusdp1tract_to_sqlite command. Instead, the database is created by the ferc714_out fixture.

We should probably dagsterify the census ETL soon but given the db is still being created by the nightly builds on the dagster-asset-etl branch I don't think this is blocking #2104.

bendnorman commented 1 year ago

As for how we would dagsterify the ETL, I think we could wrap pudl.convert.censusdp1tract_to_sqlite.censusdp1tract_to_sqlite in a @multi_asset and use and create a new io manager the points to the census db:

@multi_asset(
    outs={
        table_name: AssetOut(io_manager_key="census_sqlite_io_manager")
        for table_name in Package.get_etl_group_tables("census")
    },
    required_resource_keys={"dataset_settings", "datastore"},
)
def censusdp1tract_to_sqlite(context):
     # existing extraction code
     ...
    # This multi asset won't return any dataframes because ogr handles the actual extraction.
    # We might have to make all of the `outs` optional.

census_sqlite_io_manager will probably need to subclass SQLiteIOManagerlike the FERC IO Managers.

If adding the multi_asset to the main pudl.etl DAG we could use a similar pattern to ferc_to_sqlite and load the census tables as SourceAssets in the main portion of the DAG.

zaneselvans commented 1 year ago

Okay, sounds like we can do this as part of the next Dagster phase!

e-belfer commented 1 year ago

In #2437 I'm removing the ferc714_out fixture that currently generates the census DB, and reading in the relevant 714 tables directly from PUDL where they are getting saved. I think this likely makes this issue a blocker for integrating #2437.

zaneselvans commented 1 year ago

I'm hopeful that this is actually a simple issue -- the code that does the transformation from GeoDB to SQLite now is a one-liner, though it does call out to an external CLI tool that's part of the open geospatial stack, IIRC. So I think it'll look more like the process we're using to generate the "source assets" for the FERC Form 1 SQLite DBs.