catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Missing retirement dates from 860m data #2834

Closed arengel closed 1 year ago

arengel commented 1 year ago

Describe the bug

The planned_generator_retirement_date and generator_retirement_date columns are empty for monthly_update generator data.

Bug Severity

How badly is this bug affecting you?

To Reproduce

Read the data from the database downloaded Sept 5, 2023 at noon.

import pandas as pd
import sqlalchemy as sa

gens = pd.read_sql_table(
    table_name="generators_eia860",
    con=sa.create_engine("sqlite:////.../pudl.sqlite").connect(),
).query("data_maturity == 'monthly_update'")
gens_denorm = pd.read_sql_table(
    table_name="denorm_generators_eia",
    con=sa.create_engine("sqlite:////.../pudl.sqlite").connect(),
).query("data_maturity == 'monthly_update'")

Then looking at the specific columns, all the values are pd.NaT.

gens.planned_generator_retirement_date.unique()
Out[1]: array(['NaT'], dtype='datetime64[ns]')

gens_denorm.planned_generator_retirement_date.unique()
Out[2]: array(['NaT'], dtype='datetime64[ns]')

gens.generator_retirement_date.unique()
Out[3]: array(['NaT'], dtype='datetime64[ns]')

gens_denorm.generator_retirement_date.unique()
Out[4]: array(['NaT'], dtype='datetime64[ns]')

Expected behavior

These columns would be populated with retirement and planned retirement data from EIA 860 monthly.

Software Environment?

zaneselvans commented 1 year ago

Whatever the issue ends up being here, we should make sure we add a new data validation test that checks for entirely NULL values in these columns when selecting just monthly update records.

aesharpe commented 1 year ago

The monthly EIA data that gets subsumed into tables like the generators_eia860 is supposed to have the months delineated in the report_date column, correct? Like 2023-06-01? Unless the data_maturity is not monthly_updates in which case it would all say 2023-01-01. I assumed this was already happening, but it looks like the raw EIA860m data extraction does not preserve the month, just the year.

I'm looking at this specific part of process_raw for the EIA860m extractor:

    def process_raw(self, df, page, **partition):
        """Adds source column and report_year column if missing."""
        df = df.rename(columns=self._metadata.get_column_map(page, **partition))
        if "report_year" not in df.columns:
            df["report_year"] = datetime.strptime(
                list(partition.values())[0], "%Y-%m"
            ).year

In the db table all the monthly data has the same date: 2023-01-01 which is wrong. I feel like this issue can't have always been there...am I crazy?

zaneselvans commented 1 year ago

No, there shouldn't be any monthly resolution data in the generators_eia860 table (even though the report_date column contains a full date)

The original report dates are monthly, since it's a monthly update, but the data that they're reporting is being modeled as having annual frequency within our database, so I think that just retaining the year is appropriate in the extraction here. We're just using the most recent monthly update to indicate an annual value for the new year that hasn't yet been completely reported. And we only ever use a single monthly snapshot from the EIA-860m in the ETL -- we aren't ever extracting more than one year-month of EIA-860m data.

So I think all of the most recently updated EIA-860M data should have 2023-01-01 as its report date.

Is there any other way that the monthly frequency information might be sneaking in? I could see that messing up a merge later on down the line that causes all of these values to get lost.

aesharpe commented 1 year ago

Looks like the issue had to do with the 860m data column maps. The monthly data retirement dates were getting mapped to retirement_year and retirement_month instead of generator_retirement_year and generator_retirement_month in the mapping spreadsheets. Only the generator_retirement column is included in the metadata for the cleaned version of the generators_eia860 so all the values for retirement date from the monthly data were getting dropped.

I updated the spreadsheet to map to generator_retirement_year and generator_retirement_month which should fix the problem.

zaneselvans commented 1 year ago

I went ahead and merged #2835 so this can be fixed in the builds tonight. @arengel hopefully tomorrow morning you can grab a new DB with the updated retirement dates.

zaneselvans commented 1 year ago

Unfortunately, we now have a problem with building the Docker container that runs the nightly builds. Looking into it now. We can run the ETL locally and hand off a fresh PUDL DB if need be.