fix filtering in `prepare_financial_data()`

cjyetman commented 1 month ago

The problem was that unwanted Equity rows were getting through the filters of the financial data. Equity rows that have NA or 0 for unit_share_price or current_shares_outstanding end up with a NA or 0 value for market capitalization (unit_share_price * current_shares_outstanding), which leads to an NA or Inf value for "ownership weight" when "value invested" is divided by "market_cap". Since "market cap" is necessary information for the ownership weight calculation, we should filter out rows where market capitalization cannot be properly calculated.

library(dplyr)

financial_data <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/financial_data.rds")
entity_info <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/entity_info.rds")

financial_data_no_mkt_cap <- financial_data %>%
  group_by(factset_entity_id, asset_type) %>%
  filter(
    asset_type == "Equity",
    sum(current_shares_outstanding_all_classes) == 0 | sum(unit_share_price) == 0 
  ) %>%
  left_join(entity_info, by = "factset_entity_id") %>%
  filter(!is.na(ar_company_id))

financial_data_no_mkt_cap
#> # A tibble: 23 × 15
#> # Groups:   factset_entity_id, asset_type [22]
#>    isin     unit_share_price current_shares_outst…¹ asset_type factset_entity_id
#>    <chr>               <dbl>                  <dbl> <chr>      <chr>            
#>  1 US92554…            0                 4701690000 Equity     065FWN-E         
#>  2 US02341…           38.8                        0 Equity     05DZHZ-E         
#>  3 US03420…            0                 4919209650 Equity     006JZS-E         
#>  4 US05858…            0.383                      0 Equity     05J60L-E         
#>  5 US29562…            0                  215271000 Equity     09NNL5-E         
#>  6 US39322…            0                  499989996 Equity     05LWKW-E         
#>  7 US69338…            0                43123215171 Equity     002TYS-E         
#>  8 US91349…            0                86188033465 Equity     0045KD-E         
#>  9 US24379…            0                  229374605 Equity     001B1Q-E         
#> 10 AU00000…            1.20                       0 Equity     0K17S2-E         
#> # ℹ 13 more rows
#> # ℹ abbreviated name: ¹current_shares_outstanding_all_classes
#> # ℹ 10 more variables: company_name <chr>, country_of_domicile <chr>,
#> #   bics_sector_code <chr>, bics_sector <chr>,
#> #   security_bics_subgroup_code <chr>, security_bics_subgroup <chr>,
#> #   security_mapped_sector <chr>, ar_company_id <chr>, credit_parent_id <chr>,
#> #   credit_parent_ar_company_id <chr>

These filters were originally introduced in https://github.com/RMI-PACTA/archive.pacta.data.preparation/pull/226 (sorry @jdhoffa, not trying to point fingers, just want to be super clear about the provenance). It looks like the filters added there did not have any additional intent, they simply did not work completely as expected.

I'm propsing to modify the filter with a dplyr::case_when() which is easier to interpret and maintain. To understand the logic, see the below code. The goals are:

for Bonds, keep all rows
for Other, keep all rows
for Funds, keep only rows that do not have NA for the adj_price
for Equity, keep only rows that have a positive value (non-NA, and > 0) for both adj_price and adj_shares_outstanding

The rows of Equity in the filtered result should only have one row where both adj_price and adj_shares_outstanding have a positive value.

library(dplyr)

bonds <- tibble(
  asset_type = c("Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

equity <- tibble(
  asset_type = c("Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

funds <- tibble(
  asset_type = c("Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

others <- tibble(
  asset_type = c("Others", "Others", "Others", "Others", "Others", "Others", "Others", "Others", "Others"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

test <- bind_rows(bonds, equity, funds, others)

test %>% print(n = 40)
#> # A tibble: 36 × 3
#>    asset_type adj_price adj_shares_outstanding
#>    <chr>          <dbl>                  <dbl>
#>  1 Bonds              1                      1
#>  2 Bonds              1                      0
#>  3 Bonds              1                     NA
#>  4 Bonds              0                      1
#>  5 Bonds              0                      0
#>  6 Bonds              0                     NA
#>  7 Bonds             NA                      1
#>  8 Bonds             NA                      0
#>  9 Bonds             NA                     NA
#> 10 Equity             1                      1
#> 11 Equity             1                      0
#> 12 Equity             1                     NA
#> 13 Equity             0                      1
#> 14 Equity             0                      0
#> 15 Equity             0                     NA
#> 16 Equity            NA                      1
#> 17 Equity            NA                      0
#> 18 Equity            NA                     NA
#> 19 Funds              1                      1
#> 20 Funds              1                      0
#> 21 Funds              1                     NA
#> 22 Funds              0                      1
#> 23 Funds              0                      0
#> 24 Funds              0                     NA
#> 25 Funds             NA                      1
#> 26 Funds             NA                      0
#> 27 Funds             NA                     NA
#> 28 Others             1                      1
#> 29 Others             1                      0
#> 30 Others             1                     NA
#> 31 Others             0                      1
#> 32 Others             0                      0
#> 33 Others             0                     NA
#> 34 Others            NA                      1
#> 35 Others            NA                      0
#> 36 Others            NA                     NA

test %>% 
  filter(
    case_when(
      asset_type == "Bonds" ~ TRUE,
      asset_type == "Others" ~ TRUE,
      asset_type == "Funds" ~ !is.na(.data$adj_price),
      asset_type == "Equity" ~ .data$adj_price > 0 & .data$adj_shares_outstanding > 0
    )
  ) %>% 
  print(n = 40)
#> # A tibble: 25 × 3
#>    asset_type adj_price adj_shares_outstanding
#>    <chr>          <dbl>                  <dbl>
#>  1 Bonds              1                      1
#>  2 Bonds              1                      0
#>  3 Bonds              1                     NA
#>  4 Bonds              0                      1
#>  5 Bonds              0                      0
#>  6 Bonds              0                     NA
#>  7 Bonds             NA                      1
#>  8 Bonds             NA                      0
#>  9 Bonds             NA                     NA
#> 10 Equity             1                      1
#> 11 Funds              1                      1
#> 12 Funds              1                      0
#> 13 Funds              1                     NA
#> 14 Funds              0                      1
#> 15 Funds              0                      0
#> 16 Funds              0                     NA
#> 17 Others             1                      1
#> 18 Others             1                      0
#> 19 Others             1                     NA
#> 20 Others             0                      1
#> 21 Others             0                      0
#> 22 Others             0                     NA
#> 23 Others            NA                      1
#> 24 Others            NA                      0
#> 25 Others            NA                     NA

Note that .data$adj_price > 0 & .data$adj_shares_outstanding > 0 can return NAs, but dplyr::filter()drops rows where the result isNA(versusTRUEorFALSE`).

library(dplyr)

c(1, 0, NA) > 0
#> [1]  TRUE FALSE    NA

tibble(x = LETTERS[1:3]) %>% filter(c(1, 0, NA) > 0)
#> # A tibble: 1 × 1
#>   x    
#>   <chr>
#> 1 A

Modifying @Antoine-Lalechere's example code to use the new filter (and adjusting the column names appropriately), should yield no rows...

library(dplyr)

financial_data <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/financial_data.rds")
entity_info <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/entity_info.rds")

financial_data_no_mkt_cap <- financial_data %>%
  filter(
    case_when(
      asset_type == "Bonds" ~ TRUE,
      asset_type == "Others" ~ TRUE,
      asset_type == "Funds" ~ !is.na(.data$unit_share_price),
      asset_type == "Equity" ~ .data$unit_share_price > 0 & .data$current_shares_outstanding_all_classes > 0
    )
  ) %>% 
  group_by(factset_entity_id, asset_type) %>%
  filter(
    asset_type == "Equity",
    sum(current_shares_outstanding_all_classes) == 0 | sum(unit_share_price) == 0 
  ) %>%
  left_join(entity_info, by = "factset_entity_id") %>%
  filter(!is.na(ar_company_id))

financial_data_no_mkt_cap
#> # A tibble: 0 × 15
#> # Groups:   factset_entity_id, asset_type [0]
#> # ℹ 15 variables: isin <chr>, unit_share_price <dbl>,
#> #   current_shares_outstanding_all_classes <dbl>, asset_type <chr>,
#> #   factset_entity_id <chr>, company_name <chr>, country_of_domicile <chr>,
#> #   bics_sector_code <chr>, bics_sector <chr>,
#> #   security_bics_subgroup_code <chr>, security_bics_subgroup <chr>,
#> #   security_mapped_sector <chr>, ar_company_id <chr>, credit_parent_id <chr>,
#> #   credit_parent_ar_company_id <chr>

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 18.82%. Comparing base (fa6e801) to head (d3d185a).

Files	Patch %	Lines
R/prepare_financial_data.R	0.00%	5 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #32 +/- ## ======================================= Coverage 18.82% 18.82% ======================================= Files 35 35 Lines 1126 1126 ======================================= Hits 212 212 Misses 914 914 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jdhoffa commented 1 month ago

@cjyetman thanks as always for the careful investigation here, and the comprehensive explanation in this PR. Really solid example of what a good PR should look like, I appreciate it.

Reviewing now.

RMI-PACTA / pacta.data.preparation

fix filtering in `prepare_financial_data()` #32

Codecov Report