feat: `match_name` should aggregate across all similar loans prior to outputting results

jdhoffa commented 3 years ago

In the reprex below, we see two almost identical loans, with two different values for id_loan. The corresponding output of match_name will have this repeated as many times as there are different id_loan.

I'm not sure if there is an internal reason that we decided to do this, but if it's possible it would be easier for the user to only have to manually validate these output one.

library(r2dii.match)

lbk <- tibble::tribble(
  ~sector_classification_system, ~id_ultimate_parent,             ~name_ultimate_parent, ~id_direct_loantaker,                ~name_direct_loantaker, ~sector_classification_direct_loantaker, ~id_loan,
  "NACE",              "UP15", "Alpine Knits India Pvt. Limited",               "C294", "Yuamen Xinneng Thermal Power Co Ltd",                                    3511,     "L1",
  "NACE",              "UP15", "Alpine Knits India Pvt. Limited",               "C294", "Yuamen Xinneng Thermal Power Co Ltd",                                    3511,     "L2"
)

ald <- tibble::tribble(
  ~name_company, ~sector,                ~alias_ald,
  "alpine knits india pvt. limited", "power", "alpineknitsindiapvt ltd"
)

match_name(lbk, ald) %>% 
  dplyr::select(id_loan, name, sector, name_ald, sector_ald, score, level) %>% 
  prioritize()
#> # A tibble: 2 x 7
#>   id_loan name              sector name_ald           sector_ald score level    
#>   <chr>   <chr>             <chr>  <chr>              <chr>      <dbl> <chr>    
#> 1 L1      Alpine Knits Ind… power  alpine knits indi… power          1 ultimate…
#> 2 L2      Alpine Knits Ind… power  alpine knits indi… power          1 ultimate…

^{Created on 2020-12-01 by the reprex package (v0.3.0)}

AB#10177

jdhoffa commented 3 years ago

Thanks @georgeharris2deg

maurolepore commented 3 years ago

I'm not sure if there is an internal reason that we decided to do this, but if it's possible it would be easier for the user to only have to manually validate these output one.

This output would be explained by us picking rows with distinct values of only id_loan. We could probabbly detect the similarity in other columns. The decision seems to depend on how much of a problem this is and if it is worth adding the complexity in the code.

jdhoffa commented 5 months ago

Updating that recent inspection shows that this is still the case:

library(r2dii.match)

lbk <- tibble::tribble(
  ~sector_classification_system, ~id_ultimate_parent,             ~name_ultimate_parent, ~id_direct_loantaker,                ~name_direct_loantaker, ~sector_classification_direct_loantaker, ~id_loan,
  "NACE",              "UP15", "Alpine Knits India Pvt. Limited",               "C294", "Yuamen Xinneng Thermal Power Co Ltd",                                    "D35.1",     "L1",
  "NACE",              "UP15", "Alpine Knits India Pvt. Limited",               "C294", "Yuamen Xinneng Thermal Power Co Ltd",                                    "D35.1",     "L2"
)

ald <- tibble::tribble(
  ~name_company, ~sector,                ~alias_ald,
  "alpine knits india pvt. limited", "power", "alpineknitsindiapvt ltd"
)

match_name(lbk, ald) %>% 
  dplyr::select(id_loan, name, sector, name_abcd, sector_abcd, score, level) %>% 
  prioritize()
#> # A tibble: 2 × 7
#>   id_loan name                          sector name_abcd sector_abcd score level
#>   <chr>   <chr>                         <chr>  <chr>     <chr>       <dbl> <chr>
#> 1 L1      Alpine Knits India Pvt. Limi… power  alpine k… power           1 ulti…
#> 2 L2      Alpine Knits India Pvt. Limi… power  alpine k… power           1 ulti…

^{Created on 2024-03-26 with reprex v2.1.0}

RMI-PACTA / r2dii.match

feat: `match_name` should aggregate across all similar loans prior to outputting results #335