RMI-PACTA / workflow.data.preparation

The goal of `workflow.data.preparation` is to prepare all of the necessary data inputs for the Transition Monitor web application.
Other
2 stars 0 forks source link

review invalid ISINs in FactSet data #196

Closed cjyetman closed 6 months ago

cjyetman commented 7 months ago
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

AFAIK, those are valid ISINs, they are just ISINs of private securities (venture capital, and things like that). Had a look at this when I was looking into Private equity, it's sorta cool that we have them actually! I would loosen the restrictions on the expectation there personally (but maybe ask Nick for a second opinion?)

But in any case, TM doesn't publicize that it can cover private equity, so I think it's totally fine to just filter them out.

Originally posted by @jdhoffa in https://github.com/RMI-PACTA/workflow.data.preparation/issues/185#issuecomment-1983585121

root_dir <- "~/data/workflow-data-preparation-outputs/2023Q4_20240303T082642Z"

pacta.data.validation::validate_financial_data(
  readRDS(file.path(root_dir, "financial_data.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

pacta.data.validation::validate_abcd_flags_equity(
  readRDS(file.path(root_dir, "abcd_flags_equity.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

AB#10378

cjyetman commented 6 months ago

Because of pacta.portfolio.audit:::overall_validity_flag(), any ISIN/holding like this will ultimately be flagged as "Invalid or missing ISIN" when the "audit" process is run, e.g. workflow.transition.monitor/web_tool_script_1.R

That being said, I exported a sample/test portfolio using all of these invalid ISINs and ran it through web_tool_script_1.R, and it is able to match some financial and entity information to some of them (all are "Bonds", "Others", or "Unclassifiable"), and some of those even have matching assets in the ABCD.

My current inclination is to not change the definition of pacta.data.validation::validate_financial_data() and/or pacta.data.validation::validate_abcd_flags_equity() at the moment, and to not implement those validations here in workflow.data.preparation for the moment. That would better allow the possibility of exploring what usefulness we could get out of matching some financial and entity info to holdings of these ISINs in the future (if we remove these from our dataset now, we're more likely to forget that they are available and never explore it).

There are 2513 of these invalid ISINs in the current 23Q4 FactSet data that we pull. These don't cause a problem with anything, so it doesn't hurt to leave them in. But equally so, I don't believe there's a significant advantage to prioritizing an exploration of what we can do with them.

@jdhoffa thoughts?

jdhoffa commented 6 months ago

It would be unfortunate to lose validation entirely just because of these handful of ISINs though... I would vastly prefer that you:

and then implement pacta.data.validation in workflow.data.preparation

That way we will at least get a record of the presence of these ISINs in the logs every time we run data prep.

cjyetman commented 6 months ago

ok... gonna move this over as a task in pacta.data.validation then

cjyetman commented 6 months ago

closing in favor of #222