RMI-PACTA / workflow.factset

Other
0 stars 0 forks source link

feat: filter financial data rows with insufficient/ non-useful data #66

Open jdhoffa opened 7 months ago

jdhoffa commented 7 months ago

Migrated from https://github.com/RMI-PACTA/pacta.data.preparation/issues/270

cc @cjyetman @AlexAxthelm

https://github.com/RMI-PACTA/pacta.data.preparation/blob/831c9b960c8be8e27eeca53f6db489f000603268/R/prepare_financial_data.R#L23-L28

Currently, prepare_financial_data() does some filtering to remove rows that have insufficient data to be useful, however some rows still make it through that also may not be useful. For instance, there are currently some Equity rows that have adj_price == 0 or adj_shares_outstanding == 0 but not both. Since the share ownership weight is calculated with number_of_shares / shares_outstanding_all_classes, and to get number_of_shares from market_value we need the share price, both adj_price and adj_shares_outstanding need to be non-NA, legitimate values for a row of data to be useful. Whether these rows with adj_price == 0 or adj_shares_outstanding == 0 are "legitimate" values is not fully known currently. We could/should either verify with FactSet if these values are legit, or we could consider assuming they are not legit and removing them (though that may be a rabbit hole we want to avoid).

This is the current distribution of the problem...

Screenshot 2023-03-15 at 17 02 43
jdhoffa commented 7 months ago

Context from @cjyetman: It possibly does not need any action, but would be good to verify what is currently happening and if there is an improvement that can be made, e.g. figuring out what data is available in FactSet and if this is something we can/should watch out for. If there are more rows that could be removed, I believe that is something ideally done in the FactSet extraction code, rather than waiting to do it later in the data.prep process.