Open cmgosnell opened 7 months ago
just to note here, when i was trying to integrate 2019 and 2020 there is data in these two columns that indicate info about the revision of the data. those columns were completely null for the more recent years, so I took this out after removing 2019 and 2020 from the integration. but if we ever go back and tackle this weird thing and integrate 2019 and 2020 we could add back in this very small normalized table
in Normalizer
:
revisions: TableNormalizer = TableNormalizer(
idx=["report_year"],
columns=["revision_num", "update_date"],
)
and the asset definition:
@asset
def core_nrelatb__yearly_revisions(
_core_nrelatb__transform_start: pd.DataFrame,
) -> pd.DataFrame:
"""Transform small table including which revision the data pertains to and when it was updated."""
return transform_normalize(_core_nrelatb__transform_start, Normalizer().revisions)
In trying to integrate NREL ATB, I ran into an oddity that made it difficult to integrate the data from 2019 and 2020. Because of this we are not integrating these years of ATB. All of this exploration was based on this semi-cleaned asset: _core_nrelatb__transform_start.pkl.zip
There is a column called
core_metric_case
which is always either "Market" or "R&D". Then there is another column calledcore_metric_key
which is a composite (semi-)primary key column containing codes representing info stored in other columns in the data. The first character of thecore_metric_key
is always eitherR
orM
. We've called this first letter of thecore_metric_key
themystery_code
. We and other collaborators thought this corresponded to thecore_metric_case
. It does the ~75% of the time:I'll note that the
core_metric_key
seems to have changed structure over time - especially in 2023. Also, themystery_code
never deviated from thecore_metric_case
in 2023.This would be fine if the values in the
value
column did not vary based on themystery_code
(we could drop fully duplicate records w/o thismystery_code
or thecore_metric_key
). But the data does seem to value truly different values. Of the three data tables which are derived from the info in thevalue
column, two tables have real variability invalue
based on themystery_code
. The records that are variable bymystery_code
make up 12% of thecore_nrelatb__yearly_projections_by_scenario
table and 5% of thecore_nrelatb__yearly_rates_projections
table.