Closed bendnorman closed 1 year ago
Are you thinking of this as a prelude to or instead of defining the simple denormalized tables as views? Or replacing these functions with functions that create the views e.g. using SQLAlchemy?
As a first pass, each output asset can be created using the existing python code. Then we can go through and replace the pandas code with python that creates views using SQLAlchemy. The assets that create SQL views won't return anything so we'll use the default IO manager, so empty tables aren't written to the database.
Should this issue supersede #1178?
I made a comment in your output table spreadshee (which seems like a kind of tracking document that could fit well into the new GitHub Projects architecture) about where the dividing line should be between creating a view (using SQL) and defining an asset in Python.
There seem to be a fair number of tables that are mostly simple enough to recreate using just SQL, but where we've done some backfilling or other imputation for a handful of columns where we need better coverage than the original data provides. Would it be simpler to separate out the few columns that need more work into a much smaller software defined asset that has the same primary keys as the table they'll ultimate be joined with in a view?
E.g. we have the generators_entity_eia
table, which gets joined with the generators_eia860
table plus some information from the plants
and utilities
tables to produce a more usable denormalized view. But if there are some columns in the generators_eia860
table that need imputing / backfilling, maybe there's an asset dedicated to producing just those columns, say filling in prime_mover_code
and energy_source_code
which could be combined with generators_eia860
in a view. And another asset that generates FIPS county codes for plants based on their lat/lon or county & state names, which is combined with the plants_entity_eia
table in a view And then those generator & plant views which are compositions of big original EIA tables + skinny imputed value tables are what get brought together in an expansive generators
view? Or alternatively the skinny imputed value tables could just be combined with all of the other tables at the same time when creating the expansive generators
table.
It just seems little weird to take a whole giant table and keep it in Python just because a single column within it requires more complex work. With that pattern it seems like eventually more and more tables would end up having to stay in Python, as we do more and more work to fill in missing values in various places.
I guess we'll also want to implement some kind of flagging column to annotate the imputed value columns, indicating whether the value is original or filled, and if it's filled what method was used to fill it.
I'm ignorant about dagster assets, but is the jist of this that we decorate each pudl output function with something like:
def materialize_view(table_name: str, engine=pudl_engine, schema='data_mart') -> callable:
# no schemas in sqlite so do it via namespace
table_name = schema + '_' + table_name
def decorator(func) -> callable:
@functools.wraps(func)
def wrapper(*args, **kwargs):
if table_name in pudl_engine.get_tables(): # or however this works
return pd.read_sql(table_name, pudl_engine)
table = func(*args, **kwargs)
table.to_sql(table_name, pudl_engine)
return table
return wrapper
return decorator
@materialize_view('our_awesome_table_of_plants')
def plants_eia860(...):
...
Plus add all the output tables to the ETL process so they exist in the db. The decorator still returns the python table for backwards compatibility.
It just seems little weird to take a whole giant table and keep it in Python just because a single column within it requires more complex work.
This seems like one of those things that is obscenely inefficient in a relative sense ("one column out of 25?!") but in absolute terms might be only 50 MB of disk space (out of 900+) and 500ms run time (out of 2+ hours). Once we add these tables to sqlite maybe we can scope out how much storage/time this really costs and see if we want to prioritize that.
@zaneselvans I like the idea of splitting some output tables into two assets if a majority of the output table is just simple joins.
@TrentonBush the decorator you created is kind of what the asset decorator does! The asset decorator writes the dataframe returned by the function to the database. Check out the example output assets in the Dagster PR:
@TrentonBush It's not disk space that I'm concerned about here. I'm more interested in having a single source of truth for the values people are seeing in the DB. Otherwise I think we'll have to deal with bugs where for whatever reason the values in the two tables (original and merged) diverge.
I think separating out different imputations / calculations into different functions / assets will make it clearer where each operation is happening, and let us explicitly define the dependencies between them which will make those dependencies clearer, hopefully narrower in scope, and potentially allow more things to run in parallel.
I'm hoping it will also provide a more general and uniform pattern for how we add calculated / imputed values, rather than having some big functions where there's like 10 different things happening and maybe getting entangled with each other.
It might also be nice to have richer annotations about how a calculated / imputed column was generated in its "home" table (several columns of flags) rather than having a bunch of that kind of data provenance column in the denormalized table where everything is merged together. (not sure about this though, maybe it would better to pull all those annotations into the big table too).
How do we want to handle the output tables that can be aggregated at different frequencies using the freq
parameter? Should we only create the unaggregated version of the tables or create the AS and MS aggregates as well?
Also, I created discussion #2275 to flesh out how we want to organize all of our new tables.
If I'm interpreting correctly, the @asset
decorator handles the materialization but not the caching. It's the caching that makes things backwards compatible (except for the function args like freq
), It sounds like a lot of design decisions (and scope) here will depend on our commitment to backward compatibility. Did we make that decision yet?
@TrentonBush I'm not sure I understand what you're saying about the caching. If the tables are stored in the DB with all of the difficult calculations already done, then the DB effectively becomes the cache, and reading tables from the DB is fast. Not as fast as a dataframe that's already in memory, but way faster than performing the calculations. So if the same PudlTabl
methods are preserved, but pull tables from the DB rather than from memory, hopefully they'll mostly just work? Though the methods that have arguments that control what exactly is done to the data will no longer have that flexibility. The underlying functions will still be available in the codebase though, and could be run with other arguments if need be.
I don't think we know to what extent other folks are using the PudlTabl
objects, or how much use of the customized parameters is happening. I doubt we'll be able to ensure nothing breaks. Hopefully we can get @gschivley and @grgmiller to test drive some of these changes.
@bendnorman I had been imagining that we would store monthly and yearly aggregated versions of the tables in the DB -- with the DB effectively acting as a cache for all of these derived values. In a lot of the data tables there are only one or two time varying data columns that need to be aggregated, alongside the primary key values that are used for the aggregation (report date, plant ID, generator ID, boiler ID, maybe categoricals like energy source code). To the extent possible it seems like we want to keep the attributes that are associated with the IDs separate, and let them be merged with the aggregated data as need be for other calculations, or just pull a few useful stock columns for legibility in a denormalized view of the aggregated data tables. Otherwise any timeseries data that's reported by generator-month is going to have 100+ columns for all the attributes associated with generators and plants.
The aggregations get complicated (and slow, I think) when there are many different columns being aggregated in different ways -- sums, weighted averages, homogeneous categoricals, number of records in a group. I think the MCOE functions do a bunch of this.
We should probably make a list of what data columns are being aggregated in what tables now, and how.
Oh I think I misinterpreted the @asset
decorator as writing the tables to the db but not reading them later (the other half of caching). Which makes no sense. Nevermind!
To the extent possible it seems like we want to keep the attributes that are associated with the IDs separate, and let them be merged with the aggregated data as need be for other calculations, or just pull a few useful stock columns for legibility in a denormalized view of the aggregated data tables. Otherwise any timeseries data that's reported by generator-month is going to have 100+ columns for all the attributes associated with generators and plants.
@zaneselvans this seems like the kind of refactoring that should be a separate issue/process. The output tables will contain duplicate info from the data warehouse -- that's the nature of denormalization. I agree its inefficient and overly complex but it's legacy code. If it works today it is good enough to cache.
The problem we are trying to solve here is data access without python. If we include big refactors into the scope this is going to take weeks/months instead of days and soak up grant budget that could be spent on other things. There are a zillion legacy code issues we could fix; is this number one?
I don't think we know to what extent other folks are using the PudlTabl objects
I think this is one of the compelling reasons why we shouldn't be investing heavily in refactoring here.
I don't think we know to what extent other folks are using the PudlTabl objects, or how much use of the customized parameters is happening. I doubt we'll be able to ensure nothing breaks. Hopefully we can get @gschivley and @grgmiller to test drive some of these changes.
We use pudltabl objects quite extensively in OGE: https://github.com/singularity-energy/open-grid-emissions/search?l=Python&q=pudl_out
@grgmiller From those search results it looks like it would be pretty doable (and ultimately faster) to replace those pudl_out
method calls with a pd.read_sql()
selecting the necessary data, if it were already available directly within the DB. Does that seem right to you? Though I see you have fill_tech_desc=False
in your initialize_pudl_out()
function. Was there something wrong with that process?
My primary use of PudlTabl is calculating heat rates. For other reasons I've been planning to cache my own outputs. I'd love to see a table of heat rates in PUDL but it's also difficult given the number of different timeseries that it could be calculated over.
@grgmiller From those search results it looks like it would be pretty doable (and ultimately faster) to replace those
pudl_out
method calls with apd.read_sql()
selecting the necessary data, if it were already available directly within the DB. Does that seem right to you? Though I see you havefill_tech_desc=False
in yourinitialize_pudl_out()
function. Was there something wrong with that process?
Yes, I think it would be pretty easy to change how we access the data, as long as it is consistent and has the same output/structure as the pudl_out tables.
I think fill_tech_desc
was just set to False by default, I actually don't remember what this parameter even does.
It backfills the generator technology description, which only started getting reported in 2013, but which is almost static, and provides more detail about what kind of generator it is (e.g. supercritical pulverized coal, CCNG with heat recovery, etc.)
For now the plan is just to load these "output" tables into the database so they can be accessed directly without needing to use all the Python stuff, so there shouldn't be any structural or contents changes. To get a particular year or years worth of data you'd want to do a SELECT
based on report_date
but that's easy.
Given that we are rapidly approaching the end of the quarter and are behind on this, I am feeling more sympathetic to @TrentonBush's position.
I'm wondering if we might want to experiment with just writing our existing dataframe outputs into the DB, so we can at least have some version of the derived values available in the most easily accessible place. I don't know for sure that this would be much easier, but if it were much easier, maybe we could run through and do it as a rough first cut, and then go through gradually replacing the denormalized tables with views over time?
I don't think the dataframe dumps and retaining PudlTabl
is a good solution long term, but it might let us get a rough draft out the door.
Welp, that only took a year! 😬
Description
This Epic continues the work started in #2246.
Currently we provide access to human-readable denormalized tables using a software abstraction. This adds a layer of complexity and requires users to use Python. It can also be slow and memory intensive. Instead, for simple derived values and denormalized tables we can provide this type of output by defining views (stored queries) inside the databases we generate and distribute.
Dagster allows us to create dependencies between assets by specifying the asset name in the asset function parameters. For example:
Motivation
In Scope
@asset
decorators.PudlTabl
.technology_description
ingenerators_eia860
orbalancing_authourity_code_eia
inplants_eia860
if and only if it is straightforward to disentangle them.Victory Conditions
PudlTabl
class is no longer performing any complex work, and simply preserves the earlier interface while pulling assets from the database for users.Out of Scope