RMI-PACTA / workflow.data.preparation

The goal of `workflow.data.preparation` is to prepare all of the necessary data inputs for the Transition Monitor web application.
Other
2 stars 0 forks source link

implement `pacta.data.validation` #185

Closed cjyetman closed 6 months ago

cjyetman commented 8 months ago

closes #18

Most validation errors originally found (below) have been resolved. Validation of financial_data and abcd_flags_equity has been removed for now and will be added in the future.

investigation issues:

relevant fixes in pacta.data.validation:

relevant fix in pacta.data.preparation

validation of financial_data and abcd_flags_equity has been removed from this PR, and future intended implementation is tracked here


currently, a few of the validation checks fail using 2023Q4 config

root_dir <- "~/data/workflow-data-preparation-outputs/2023Q4_20240303T082642Z"

pacta.data.validation::validate_currencies(
  readRDS(file.path(root_dir, "currencies.rds"))
)

pacta.data.validation::validate_financial_data(
  readRDS(file.path(root_dir, "financial_data.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

pacta.data.validation::validate_masterdata_debt_datastore(
  readRDS(file.path(root_dir, "masterdata_debt_datastore.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 4 assertions failed:
#>  * Variable 'data$technology': must contain only valid technology
#>  * names, but has additional element "ICE Hydrogen_HDV".
#>  * Variable 'data$ald_production_unit': must contain only valid
#>  * production units, but has additional elements "# vehicles", "dwt
#>  * km", "t cement", "t coal", and "t steel".
#>  * Variable 'data$ald_emissions_factor_unit': must contain only valid
#>  * emissions factor units, but has additional elements "tCO2/dwt km",
#>  * "tCO2/km", "tCO2/pkm", "tCO2/tkm", "tCO2e/GJ", "tCO2e/MWh", "tCO2e/t
#>  * cement", "tCO2e/t coal", and "tCO2e/t steel".
#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

pacta.data.validation::validate_masterdata_ownership_datastore(
  readRDS(file.path(root_dir, "masterdata_ownership_datastore.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 4 assertions failed:
#>  * Variable 'data$technology': must contain only valid technology
#>  * names, but has additional element "ICE Hydrogen_HDV".
#>  * Variable 'data$ald_production_unit': must contain only valid
#>  * production units, but has additional elements "# vehicles", "dwt
#>  * km", "t cement", "t coal", and "t steel".
#>  * Variable 'data$ald_emissions_factor_unit': must contain only valid
#>  * emissions factor units, but has additional elements "tCO2/dwt km",
#>  * "tCO2/km", "tCO2/pkm", "tCO2/tkm", "tCO2e/GJ", "tCO2e/MWh", "tCO2e/t
#>  * cement", "tCO2e/t coal", and "tCO2e/t steel".
#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

pacta.data.validation::validate_abcd_flags_bonds(
  readRDS(file.path(root_dir, "abcd_flags_bonds.rds"))
)
#> Error in assert_valid_factset_entity_id(data[[col_name]], add = coll, : Assertion on 'data$credit_parent_id' failed: must contain only valid FactSet entity IDs, but has additional element NA.

pacta.data.validation::validate_abcd_flags_equity(
  readRDS(file.path(root_dir, "abcd_flags_equity.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".
jdhoffa commented 8 months ago

💯

jdhoffa commented 8 months ago
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

AFAIK, those are valid ISINs, they are just ISINs of private securities (venture capital, and things like that). Had a look at this when I was looking into Private equity, it's sorta cool that we have them actually! I would loosen the restrictions on the expectation there personally (but maybe ask Nick for a second opinion?)

But in any case, TM doesn't publicize that it can cover private equity, so I think it's totally fine to just filter them out.

jdhoffa commented 8 months ago
#> Error in base::tryCatch(base::withCallingHandlers({: 4 assertions failed:
#>  * Variable 'data$technology': must contain only valid technology
#>  * names, but has additional element "ICE Hydrogen_HDV".
#>  * Variable 'data$ald_production_unit': must contain only valid
#>  * production units, but has additional elements "# vehicles", "dwt
#>  * km", "t cement", "t coal", and "t steel".
#>  * Variable 'data$ald_emissions_factor_unit': must contain only valid
#>  * emissions factor units, but has additional elements "tCO2/dwt km",
#>  * "tCO2/km", "tCO2/pkm", "tCO2/tkm", "tCO2e/GJ", "tCO2e/MWh", "tCO2e/t
#>  * cement", "tCO2e/t coal", and "tCO2e/t steel".
#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

This part is pretty important. Could be coming from two places, either:

I guess we have a decision to make regarding:

Of the two, I guess I would probably prefer the former, which will require a PR to: https://github.com/RMI-PACTA/pacta.scenario.preparation But happy to do either, thoughts @cjyetman @AlexAxthelm ?

Hopefully, we just need to adjust the strings themselves, and don't need to change the actual data/ assumptions at all.

jdhoffa commented 8 months ago

This seems to just be an issue with HDV, which I don't really use at all (it's still not supported in r2dii.*) so I don't have any useful feedback there.

What technology name were you using for HDV here?

#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.
cjyetman commented 6 months ago

relevant fixes in pacta.data.validation:

relevant fix in pacta.data.preparation

validation of financial_data has been removed from this PR, and future intended implementation is tracked here