RMI-PACTA / pacta.data.validation

pacta.data.validation
https://rmi-pacta.github.io/pacta.data.validation/
Other
2 stars 0 forks source link

allow technically invalid, but likely intentional ISINs with a warning #69

Closed cjyetman closed 3 months ago

cjyetman commented 4 months ago
  • These ISINs represent private equity
  • Private equity is not considered to be included in the analysis, so it is totally fine that we flag them as "Invalid ISIN", because for the purposes of CTM/ P4I they ARE invalid ISINs (even if overall they are not)
  • The likelihood that a user is inputting private equity is relatively low BUT even if they do, and they notice that we say "invalid ISIN" I highly doubt they will be surprised

It would be unfortunate to lose validation entirely just because of these handful of ISINs though... I would vastly prefer that you:

  • Update pacta.data.validation::validate_financial_data() to have the following behaviour:
  • Pass if the ISIN is formatted as you currently expect (checksum and all that)
  • Warn if an ISIN of this particular type ((A-Z){2}(numeric/ special char){10})
  • Error only if the ISIN is totally bunk (any other format)

and then implement pacta.data.validation in workflow.data.preparation

That way we will at least get a record of the presence of these ISINs in the logs every time we run data prep.

Originally posted by @jdhoffa in https://github.com/RMI-PACTA/workflow.data.preparation/issues/196#issuecomment-2063155331


applies to both:

related:

AB#10854

cjyetman commented 3 months ago

I began experimenting with adding a checksum argument to is_valid_isin() to optionally turn off the validation of the checksum while still validating the structure of the passed ISINs before realizing that the set of pseudo-ISINs that triggered this contain non-alphanumeric characters.

I could even further dilute the functionality of is_valid_isin() so that the only thing it checks is that the string is 12 characters long, but at that point I seriously question the utility of it.

@jdhoffa I'm tempted to close this... thoughts?

jdhoffa commented 3 months ago

From what I remember, this was originally opened while trying to implement the pacta.data.validation package in workflow.data.preparation: https://github.com/RMI-PACTA/workflow.data.preparation/pull/185

The issue opened here with the errenous ISINs was to track the fact that we never implemented validate_financial_data in https://github.com/RMI-PACTA/workflow.data.preparation/pull/185 (since this blocked us from doing so).

Does closing this issue effectively mean we aim to no longer implement validate_financial_data in workflow.data.preparation?

cjyetman commented 3 months ago

Does closing this issue effectively mean we aim to no longer implement validate_financial_data in workflow.data.preparation?

there are at least a few options:

  1. allow all ISINs from FactSet data and do not validate
  2. remove invalid ISINs from FactSet data and then validate
  3. allow all ISINs from FactSet and validate after watering down validation so that it only check if each ISIN has 12 characters in it (this one I think is pointless)
jdhoffa commented 3 months ago

Fair enough, then sounds like option 2 is your preferred option?

Ok to close this issue so long as there is an appropriate follow-up issue (somewhere) that tracks option 2

cjyetman commented 3 months ago

Ok to close this issue so long as there is an appropriate follow-up issue (somewhere) that tracks option 2

closing with tracking of implementing validation as-is here https://github.com/RMI-PACTA/workflow.data.preparation/issues/222