catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
488 stars 113 forks source link

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

Open cmgosnell opened 7 months ago

cmgosnell commented 7 months ago

In trying to integrate NREL ATB, I ran into an oddity that made it difficult to integrate the data from 2019 and 2020. Because of this we are not integrating these years of ATB. All of this exploration was based on this semi-cleaned asset: _core_nrelatb__transform_start.pkl.zip

There is a column called core_metric_case which is always either "Market" or "R&D". Then there is another column called core_metric_key which is a composite (semi-)primary key column containing codes representing info stored in other columns in the data. The first character of the core_metric_key is always either Ror M. We've called this first letter of the core_metric_key the mystery_code. We and other collaborators thought this corresponded to the core_metric_case. It does the ~75% of the time:

>>> _core_nrelatb__transform_start[["mystery_code", "core_metric_case"]].value_counts()
mystery_code  core_metric_case
R             R&D                 728007
M             Market              453201
R             Market              330483
M             R&D                  55677
Name: count, dtype: int64

I'll note that the core_metric_key seems to have changed structure over time - especially in 2023. Also, the mystery_code never deviated from the core_metric_case in 2023.

This would be fine if the values in the value column did not vary based on the mystery_code (we could drop fully duplicate records w/o this mystery_code or the core_metric_key). But the data does seem to value truly different values. Of the three data tables which are derived from the info in the value column, two tables have real variability in value based on the mystery_code. The records that are variable by mystery_code make up 12% of the core_nrelatb__yearly_projections_by_scenario table and 5% of the core_nrelatb__yearly_rates_projections table.

image image

cmgosnell commented 7 months ago

just to note here, when i was trying to integrate 2019 and 2020 there is data in these two columns that indicate info about the revision of the data. those columns were completely null for the more recent years, so I took this out after removing 2019 and 2020 from the integration. but if we ever go back and tackle this weird thing and integrate 2019 and 2020 we could add back in this very small normalized table

in Normalizer:

revisions: TableNormalizer = TableNormalizer(
        idx=["report_year"],
        columns=["revision_num", "update_date"],
    )

and the asset definition:

@asset
def core_nrelatb__yearly_revisions(
    _core_nrelatb__transform_start: pd.DataFrame,
) -> pd.DataFrame:
    """Transform small table including which revision the data pertains to and when it was updated."""
    return transform_normalize(_core_nrelatb__transform_start, Normalizer().revisions)