Closed grgmiller closed 1 year ago
We rarely use only a subset of the years, or if we do it's typically the most recent year (for integration testing), and the data that's required to do the value filling is only present in the latter years, so I would not expect this to work -- but we should have a warning or a useful error message indicating what's happening.
However, we (primarily @bendnorman see #1973) are in the process of refactoring the ETL process to load all of the output tables directly into the database so that users don't need to bother with all of the software that's currently layered on top of the base PUDL DB. This will include all of the backfilled / imputed / calculated values, and when that's done you would be able to simply select the range of dates you want from that table of derived values.
In the meantime you'll need to include at least one year of data that contains the values being used to backfill. The more years with data you include, the more information the filling process will have to work with. These processes also assume a contiguous block of years, so yes if you wanted just 2005, you'd need at least 2005-2013.
Thanks - this is helpful. For now we'll try loading with multiple years. Any idea on the approximate timeline for this refactored ETL process to be complete?
The goal is to switch the base DB ETL over to using Dagster by the end of January (see #2104, @bendnorman will have the best of idea of the expected delivery date), and then get all of the derived values / imputations / denormalized tables integrated by the end of Q1, preserving the deprecated pudl_out
interface for backward compatibility (but having it just read the new tables from the DB) for a while before going to just distributing the database without the ginormous python package that runs the ETL.
@zaneselvans description of the expected timeline is accurate. We are planning on structuring the refactor so output tables can be converted one at a time. We can prioritize converting the gf_eia923
table so you won't run into this bug.
Thanks! We primarily work with gens_eia860
, plants_eia860
, gf_eia923
, bf_eia923
, gen_original_eia923
, and bga_eia860
Sounds good! We'll try to tackle those tables first.
Shall we close this since there's a broader fix in the works?
In our OGE pipeline, we are trying to load EIA data for years prior to 2013 by running
pudl_out.gens_eia860()
andpudl_out.gf_eia923()
. However, when we run this, we are getting the following KeyErrors.It looks like this issue is originating in
fill_in_missing_ba_codes
, which notes that :I am suspecting that this is caused because we are trying to load data for a single year (2012) rather than running pudl_out with a range of years that includes multiple years after 2012 to backfill. I recognize that often you're running the pudl pipeline for multiple years instead of a single year, but is there any way to make this work for a single year prior to 2013, or is the only solution to load newer data as well?
How many years of new data would we need to load? Just 2013? All years 2013-2021? Since pudl_out takes a start date and end date argument, if we just wanted to load data for 2005, does this mean that in order for this function to work, we would actually have to load all data for 2005-2013+, even though we are only interested in a single year?
Trying to load EIA-860 data for 2012: