catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert output and analysis tables to assets #1973

Closed bendnorman closed 8 months ago

bendnorman commented 1 year ago
- [ ] #2265
- [ ] #2266
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2339
- [ ] #2433
- [ ] #2434
- [ ] #2436
- [ ] #2412
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2437
- [ ] #2435
- [ ] #2438
- [ ] #2439
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2517

Description

This Epic continues the work started in #2246.

Currently we provide access to human-readable denormalized tables using a software abstraction. This adds a layer of complexity and requires users to use Python. It can also be slow and memory intensive. Instead, for simple derived values and denormalized tables we can provide this type of output by defining views (stored queries) inside the databases we generate and distribute.

Dagster allows us to create dependencies between assets by specifying the asset name in the asset function parameters. For example:

@asset(io_manager_key="pudl_sqlite_io_manager")
def plants_utils_eia860(
    plants_eia860: pd.DataFrame,
    utilities_eia860: pd.DataFrame
) -> pd.DataFrame:
    """Read the EIA Plants and Utilities tables from the PUDL DB, merge them together, and write to PUDL DB."""
    pu_eia860 = pd.merge(
        plants_eia860,
        utilities_eia860,
        on=["plant_id_eia", "utility_id_eia", "report_date"]
    )
    return pu_eia860

Motivation

In Scope

Victory Conditions

Out of Scope

zaneselvans commented 1 year ago

Are you thinking of this as a prelude to or instead of defining the simple denormalized tables as views? Or replacing these functions with functions that create the views e.g. using SQLAlchemy?

bendnorman commented 1 year ago

As a first pass, each output asset can be created using the existing python code. Then we can go through and replace the pandas code with python that creates views using SQLAlchemy. The assets that create SQL views won't return anything so we'll use the default IO manager, so empty tables aren't written to the database.

zaneselvans commented 1 year ago

Should this issue supersede #1178?

I made a comment in your output table spreadshee (which seems like a kind of tracking document that could fit well into the new GitHub Projects architecture) about where the dividing line should be between creating a view (using SQL) and defining an asset in Python.

There seem to be a fair number of tables that are mostly simple enough to recreate using just SQL, but where we've done some backfilling or other imputation for a handful of columns where we need better coverage than the original data provides. Would it be simpler to separate out the few columns that need more work into a much smaller software defined asset that has the same primary keys as the table they'll ultimate be joined with in a view?

E.g. we have the generators_entity_eia table, which gets joined with the generators_eia860 table plus some information from the plants and utilities tables to produce a more usable denormalized view. But if there are some columns in the generators_eia860 table that need imputing / backfilling, maybe there's an asset dedicated to producing just those columns, say filling in prime_mover_code and energy_source_code which could be combined with generators_eia860 in a view. And another asset that generates FIPS county codes for plants based on their lat/lon or county & state names, which is combined with the plants_entity_eia table in a view And then those generator & plant views which are compositions of big original EIA tables + skinny imputed value tables are what get brought together in an expansive generators view? Or alternatively the skinny imputed value tables could just be combined with all of the other tables at the same time when creating the expansive generators table.

It just seems little weird to take a whole giant table and keep it in Python just because a single column within it requires more complex work. With that pattern it seems like eventually more and more tables would end up having to stay in Python, as we do more and more work to fill in missing values in various places.

I guess we'll also want to implement some kind of flagging column to annotate the imputed value columns, indicating whether the value is original or filled, and if it's filled what method was used to fill it.

TrentonBush commented 1 year ago

I'm ignorant about dagster assets, but is the jist of this that we decorate each pudl output function with something like:

def materialize_view(table_name: str, engine=pudl_engine, schema='data_mart') -> callable:
    # no schemas in sqlite so do it via namespace
    table_name = schema + '_' + table_name
    def decorator(func) -> callable:
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            if table_name in pudl_engine.get_tables(): # or however this works
                return pd.read_sql(table_name, pudl_engine)
            table = func(*args, **kwargs)
            table.to_sql(table_name, pudl_engine)
            return table
        return wrapper
    return decorator

@materialize_view('our_awesome_table_of_plants')
def plants_eia860(...):
    ...

Plus add all the output tables to the ETL process so they exist in the db. The decorator still returns the python table for backwards compatibility.

TrentonBush commented 1 year ago

It just seems little weird to take a whole giant table and keep it in Python just because a single column within it requires more complex work.

This seems like one of those things that is obscenely inefficient in a relative sense ("one column out of 25?!") but in absolute terms might be only 50 MB of disk space (out of 900+) and 500ms run time (out of 2+ hours). Once we add these tables to sqlite maybe we can scope out how much storage/time this really costs and see if we want to prioritize that.

bendnorman commented 1 year ago

@zaneselvans I like the idea of splitting some output tables into two assets if a majority of the output table is just simple joins.

@TrentonBush the decorator you created is kind of what the asset decorator does! The asset decorator writes the dataframe returned by the function to the database. Check out the example output assets in the Dagster PR:

https://github.com/catalyst-cooperative/pudl/blob/0a9ed54cd8eec717941affa71e2efc08647d51c9/src/pudl/etl/output_assets.py#L35-L42

zaneselvans commented 1 year ago

@TrentonBush It's not disk space that I'm concerned about here. I'm more interested in having a single source of truth for the values people are seeing in the DB. Otherwise I think we'll have to deal with bugs where for whatever reason the values in the two tables (original and merged) diverge.

I think separating out different imputations / calculations into different functions / assets will make it clearer where each operation is happening, and let us explicitly define the dependencies between them which will make those dependencies clearer, hopefully narrower in scope, and potentially allow more things to run in parallel.

I'm hoping it will also provide a more general and uniform pattern for how we add calculated / imputed values, rather than having some big functions where there's like 10 different things happening and maybe getting entangled with each other.

It might also be nice to have richer annotations about how a calculated / imputed column was generated in its "home" table (several columns of flags) rather than having a bunch of that kind of data provenance column in the denormalized table where everything is merged together. (not sure about this though, maybe it would better to pull all those annotations into the big table too).

bendnorman commented 1 year ago

How do we want to handle the output tables that can be aggregated at different frequencies using the freq parameter? Should we only create the unaggregated version of the tables or create the AS and MS aggregates as well?

bendnorman commented 1 year ago

Also, I created discussion #2275 to flesh out how we want to organize all of our new tables.

TrentonBush commented 1 year ago

If I'm interpreting correctly, the @asset decorator handles the materialization but not the caching. It's the caching that makes things backwards compatible (except for the function args like freq), It sounds like a lot of design decisions (and scope) here will depend on our commitment to backward compatibility. Did we make that decision yet?

zaneselvans commented 1 year ago

@TrentonBush I'm not sure I understand what you're saying about the caching. If the tables are stored in the DB with all of the difficult calculations already done, then the DB effectively becomes the cache, and reading tables from the DB is fast. Not as fast as a dataframe that's already in memory, but way faster than performing the calculations. So if the same PudlTabl methods are preserved, but pull tables from the DB rather than from memory, hopefully they'll mostly just work? Though the methods that have arguments that control what exactly is done to the data will no longer have that flexibility. The underlying functions will still be available in the codebase though, and could be run with other arguments if need be.

I don't think we know to what extent other folks are using the PudlTabl objects, or how much use of the customized parameters is happening. I doubt we'll be able to ensure nothing breaks. Hopefully we can get @gschivley and @grgmiller to test drive some of these changes.

@bendnorman I had been imagining that we would store monthly and yearly aggregated versions of the tables in the DB -- with the DB effectively acting as a cache for all of these derived values. In a lot of the data tables there are only one or two time varying data columns that need to be aggregated, alongside the primary key values that are used for the aggregation (report date, plant ID, generator ID, boiler ID, maybe categoricals like energy source code). To the extent possible it seems like we want to keep the attributes that are associated with the IDs separate, and let them be merged with the aggregated data as need be for other calculations, or just pull a few useful stock columns for legibility in a denormalized view of the aggregated data tables. Otherwise any timeseries data that's reported by generator-month is going to have 100+ columns for all the attributes associated with generators and plants.

The aggregations get complicated (and slow, I think) when there are many different columns being aggregated in different ways -- sums, weighted averages, homogeneous categoricals, number of records in a group. I think the MCOE functions do a bunch of this.

We should probably make a list of what data columns are being aggregated in what tables now, and how.

TrentonBush commented 1 year ago

Oh I think I misinterpreted the @asset decorator as writing the tables to the db but not reading them later (the other half of caching). Which makes no sense. Nevermind!

TrentonBush commented 1 year ago

To the extent possible it seems like we want to keep the attributes that are associated with the IDs separate, and let them be merged with the aggregated data as need be for other calculations, or just pull a few useful stock columns for legibility in a denormalized view of the aggregated data tables. Otherwise any timeseries data that's reported by generator-month is going to have 100+ columns for all the attributes associated with generators and plants.

@zaneselvans this seems like the kind of refactoring that should be a separate issue/process. The output tables will contain duplicate info from the data warehouse -- that's the nature of denormalization. I agree its inefficient and overly complex but it's legacy code. If it works today it is good enough to cache.

The problem we are trying to solve here is data access without python. If we include big refactors into the scope this is going to take weeks/months instead of days and soak up grant budget that could be spent on other things. There are a zillion legacy code issues we could fix; is this number one?

I don't think we know to what extent other folks are using the PudlTabl objects

I think this is one of the compelling reasons why we shouldn't be investing heavily in refactoring here.

grgmiller commented 1 year ago

I don't think we know to what extent other folks are using the PudlTabl objects, or how much use of the customized parameters is happening. I doubt we'll be able to ensure nothing breaks. Hopefully we can get @gschivley and @grgmiller to test drive some of these changes.

We use pudltabl objects quite extensively in OGE: https://github.com/singularity-energy/open-grid-emissions/search?l=Python&q=pudl_out

zaneselvans commented 1 year ago

@grgmiller From those search results it looks like it would be pretty doable (and ultimately faster) to replace those pudl_out method calls with a pd.read_sql() selecting the necessary data, if it were already available directly within the DB. Does that seem right to you? Though I see you have fill_tech_desc=False in your initialize_pudl_out() function. Was there something wrong with that process?

gschivley commented 1 year ago

My primary use of PudlTabl is calculating heat rates. For other reasons I've been planning to cache my own outputs. I'd love to see a table of heat rates in PUDL but it's also difficult given the number of different timeseries that it could be calculated over.

grgmiller commented 1 year ago

@grgmiller From those search results it looks like it would be pretty doable (and ultimately faster) to replace those pudl_out method calls with a pd.read_sql() selecting the necessary data, if it were already available directly within the DB. Does that seem right to you? Though I see you have fill_tech_desc=False in your initialize_pudl_out() function. Was there something wrong with that process?

Yes, I think it would be pretty easy to change how we access the data, as long as it is consistent and has the same output/structure as the pudl_out tables.

I think fill_tech_desc was just set to False by default, I actually don't remember what this parameter even does.

zaneselvans commented 1 year ago

It backfills the generator technology description, which only started getting reported in 2013, but which is almost static, and provides more detail about what kind of generator it is (e.g. supercritical pulverized coal, CCNG with heat recovery, etc.)

For now the plan is just to load these "output" tables into the database so they can be accessed directly without needing to use all the Python stuff, so there shouldn't be any structural or contents changes. To get a particular year or years worth of data you'd want to do a SELECT based on report_date but that's easy.

zaneselvans commented 1 year ago

Given that we are rapidly approaching the end of the quarter and are behind on this, I am feeling more sympathetic to @TrentonBush's position.

I'm wondering if we might want to experiment with just writing our existing dataframe outputs into the DB, so we can at least have some version of the derived values available in the most easily accessible place. I don't know for sure that this would be much easier, but if it were much easier, maybe we could run through and do it as a rough first cut, and then go through gradually replacing the denormalized tables with views over time?

I don't think the dataframe dumps and retaining PudlTabl is a good solution long term, but it might let us get a rough draft out the door.

zaneselvans commented 8 months ago

Welp, that only took a year! 😬