feat: `match_name` gains `join_id` col, allowing for an initial matching override based on some unique ID column

jdhoffa commented 8 months ago

This ID column could be, for example, lei or isin.

Some open questions I have before marking this as ready for review:

[x] Currently the ID column must be identical between each dataset, but I think perhaps it might be better to allow for a named argument, with one name indicating the loanbook column name (e.g. lei_direct_loantaker), and another column indicating the abcd column name (e.g. lei)
[x] Currently the functionality only allows for a SINGLE join column. e.g. there is no way to give a handful of columns, and join by priority (ISIN -> LEI -> etc.)
[x] Definitely needs way more tests. Especially, there should be some input checks that assess that there aren't multiple joins per company. I can imagine a loanbook with multiple identical LEIs, and an abcd with several name_company values that have the same LEI. Need to consider what to do in that case

See reprex:

library(tibble)
library(r2dii.match)

loanbook <- tibble(
  sector_classification_system = "NACE",
  sector_classification_direct_loantaker = "100", # this generally shouldn't match to anything
  id_ultimate_parent = c("UP15", "UP16"),
  name_ultimate_parent = c("Foo", "Bar"),
  id_direct_loantaker = c("C294", "C295"),
  name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd",
  lei = c("LEI123", NA_character_)
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power",
  lei = "LEI123"
)

# no match found, as expected
# outputs "message" that no match was found by fuzzy matching, and warning that no match was found overall
match_name(loanbook, abcd, join_id = NULL)
#> Found no match via fuzzy matching.
#> Warning: Found no match.
#> # A tibble: 0 × 16
#> # ℹ 16 variables: sector_classification_system <chr>,
#> #   sector_classification_direct_loantaker <chr>, id_ultimate_parent <chr>,
#> #   name_ultimate_parent <chr>, id_direct_loantaker <chr>,
#> #   name_direct_loantaker <chr>, lei <chr>, id_2dii <lgl>, level <lgl>,
#> #   sector <lgl>, sector_abcd <lgl>, name <lgl>, name_abcd <lgl>, score <lgl>,
#> #   source <lgl>, borderline <lgl>

# no match found via fuzzy matching, as expected
# outputs "message" that no match was found by fuzzy matching, but successfully outputs the joined match
match_name(loanbook, abcd, join_id = "lei")
#> Found no match via fuzzy matching.
#> # A tibble: 1 × 10
#>   sector_classification_system sector_classification_direct…¹ id_ultimate_parent
#>   <chr>                        <chr>                          <chr>             
#> 1 NACE                         100                            UP15              
#> # ℹ abbreviated name: ¹sector_classification_direct_loantaker
#> # ℹ 7 more variables: name_ultimate_parent <chr>, id_direct_loantaker <chr>,
#> #   name_direct_loantaker <chr>, name <chr>, sector <chr>, alias <chr>,
#> #   score <dbl>

^{Created on 2024-03-07 with reprex v2.1.0}

Closes #135

jdhoffa commented 8 months ago

@jacobvjk requesting your review here for some input already

jdhoffa commented 8 months ago

and cc @cjyetman and @AlexAxthelm for visibility

cjyetman commented 8 months ago

Allowing a named character vector, e.g. join_id = c(lei_direct_loantaker = "lei"), might be an easy way to facilitate naming different columns in the loanbook and abcd datasets.

jdhoffa commented 8 months ago

Yup, that's what I was thinking too!

jdhoffa commented 8 months ago

I think may make sense to only allow a single join column for now. Adding multiple requires some join priority logic, and I'd rather not introduce that complexity yet. Could be a future feature once this one is in the wild, and proven to not be buggy

jdhoffa commented 8 months ago

So there's a bunch of failing checks that I will look into later (I have a feeling it's because of changes in r2dii.data messing up our vignettes etc.)

But the core functionality seems to be there :-)

jdhoffa commented 7 months ago

[x] Make sure this works well with prioritize()

jdhoffa commented 7 months ago

This seems to function now as expected, with dplyr::*_join-like syntax. See reprex below:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.match)

# 441 points to the company Russo, Russo e Russo Group
abcd_demo <- dplyr::mutate(
  abcd_demo, 
  lei = dplyr::case_when(
    company_id == "441" ~ "LEI123",
    TRUE ~ lei
  )
)

# C267 points to the company Russo s.r.l.
loanbook_demo <- dplyr::mutate(
  loanbook_demo, 
  lei_direct_loantaker = dplyr::case_when(
    id_direct_loantaker == "C267" ~ "LEI123",
    TRUE ~ lei_direct_loantaker
  )
)

out <- match_name(
  loanbook_demo, 
  abcd_demo, 
  join_id = c(lei_direct_loantaker = "lei")
)

prioritized <- prioritize(out)

prioritized |> 
  filter(id_direct_loantaker == "C267") |> 
  select(id_direct_loantaker, lei_direct_loantaker, level, source, score, name, name_abcd)
#> # A tibble: 1 × 7
#>   id_direct_loantaker lei_direct_loantaker level    source score name  name_abcd
#>   <chr>               <chr>                <chr>    <chr>  <dbl> <chr> <chr>    
#> 1 C267                LEI123               lei_dir… id jo…     1 Russ… Russo, R…

^{Created on 2024-03-12 with reprex v2.1.0}

jdhoffa commented 7 months ago

@cjyetman I've tagged you as a reviewer, understanding you may not have enough context to get exactly what is going on here (that's what @jacobvjk is there for), but I would appreciate a second pair of eyes on this 😊

RMI-PACTA / r2dii.match

feat: `match_name` gains `join_id` col, allowing for an initial matching override based on some unique ID column #460