RMI-PACTA / r2dii.match

Tools to Match Financial Portfolios with Climate Data
https://rmi-pacta.github.io/r2dii.match
Other
7 stars 6 forks source link

match_name() outputs tibble of NAs #85

Closed jdhoffa closed 4 years ago

jdhoffa commented 4 years ago

@maurolepore just trying to get back into the flow of things after the break. Noticed that the high-level match_name() wrapper returns all NAs for me on a first (naive) run-through.

I know you're still rewriting the lower level functions, so maybe this function won't be ready that is finished, but just wanted to bring this to your attention in case you expected it to be working currently.

``` r
library(r2dii.utils)
library(r2dii.dataraw)
library(r2dii.match)

your_loanbook <- r2dii.dataraw::loanbook_demo
your_loanbook
#> # A tibble: 320 x 19
#>    id_loan id_direct_loant… name_direct_loa… id_intermediate… name_intermedia…
#>    <chr>   <chr>            <chr>            <chr>            <chr>           
#>  1 L1      C294             Yuamen Xinneng … <NA>             <NA>            
#>  2 L2      C293             Yuamen Changyua… <NA>             <NA>            
#>  3 L3      C292             Yuama Ethanol L… IP5              Yuama Inc.      
#>  4 L4      C299             Yudaksel Holdin… <NA>             <NA>            
#>  5 L5      C305             Yukon Energy Co… <NA>             <NA>            
#>  6 L6      C304             Yukon Developme… <NA>             <NA>            
#>  7 L7      C227             Yaugoa-Zapadnay… <NA>             <NA>            
#>  8 L8      C303             Yueyang City Co… <NA>             <NA>            
#>  9 L9      C301             Yuedxiu Corp One IP10             Yuedxiu Group   
#> 10 L10     C302             Yuexi County AA… <NA>             <NA>            
#> # … with 310 more rows, and 14 more variables: id_ultimate_parent <chr>,
#> #   name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> #   loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>

your_ald <- r2dii.dataraw::ald_demo
your_ald
#> # A tibble: 17,368 x 13
#>    name_company sector technology production_unit  year production
#>    <chr>        <chr>  <chr>      <chr>           <dbl>      <dbl>
#>  1 aba hydropo… power  hydrocap   MW               2013    133340.
#>  2 aba hydropo… power  hydrocap   MW               2014    131582.
#>  3 aba hydropo… power  hydrocap   MW               2015    129824.
#>  4 aba hydropo… power  hydrocap   MW               2016    128065.
#>  5 aba hydropo… power  hydrocap   MW               2017    126307.
#>  6 aba hydropo… power  hydrocap   MW               2018    124549.
#>  7 aba hydropo… power  hydrocap   MW               2019    122790.
#>  8 aba hydropo… power  hydrocap   MW               2020    121032.
#>  9 aba hydropo… power  hydrocap   MW               2021    119274.
#> 10 aba hydropo… power  hydrocap   MW               2022    117515.
#> # … with 17,358 more rows, and 7 more variables: emission_factor <dbl>,
#> #   country_of_domicile <chr>, plant_location <chr>, number_of_assets <dbl>,
#> #   is_ultimate_owner <lgl>, is_ultimate_listed_owner <lgl>,
#> #   ald_timestamp <chr>

match_name(your_loanbook, your_ald)
#> # A tibble: 1,350 x 26
#>    id_loan id_direct_loant… id_intermediate… id_ultimate_par… loan_size_outst…
#>    <chr>   <chr>            <chr>            <chr>                       <dbl>
#>  1 <NA>    <NA>             <NA>             <NA>                           NA
#>  2 <NA>    <NA>             <NA>             <NA>                           NA
#>  3 <NA>    <NA>             <NA>             <NA>                           NA
#>  4 <NA>    <NA>             <NA>             <NA>                           NA
#>  5 <NA>    <NA>             <NA>             <NA>                           NA
#>  6 <NA>    <NA>             <NA>             <NA>                           NA
#>  7 <NA>    <NA>             <NA>             <NA>                           NA
#>  8 <NA>    <NA>             <NA>             <NA>                           NA
#>  9 <NA>    <NA>             <NA>             <NA>                           NA
#> 10 <NA>    <NA>             <NA>             <NA>                           NA
#> # … with 1,340 more rows, and 21 more variables:
#> #   loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id <chr>,
#> #   level <chr>, sector <chr>, sector_ald <chr>, name <chr>, name_ald <chr>,
#> #   alias <chr>, alias_ald <chr>, score <dbl>, source <chr>

Created on 2020-01-08 by the reprex package (v0.3.0)

maurolepore commented 4 years ago

Good point @jdhoffa,

The output has non-NA values towards the right of the tibble, but those values belong to new columns and all old columns coming from the input loanbook are full of NA.

I'll identify where exactly the loanbook columns are joined back but I suspect the ultimate fix will need your help. I wrote code around your work, mostly without stopping to reflect if the overall process is the best way to achieve our goal -- and I suspect we are moving around and renaming columns more than strictly necessary.

I think one way to go about cleaning our mess is to meet live, fire the debugger and step into each function together, trying to explain to each other what we are doing and why. The goal would be not to fix stuff on the fly but to create lots of tiny actionable issues, assign them to one of us, then work on them independently.

What do you think?

suppressPackageStartupMessages(
  library(dplyr)
)
library(r2dii.dataraw)
#> Loading required package: r2dii.utils
library(r2dii.match)

match_name(loanbook_demo, ald_demo) %>% 
  select_if(.predicate = ~ !all(is.na(.x)))
#> # A tibble: 1,350 x 10
#>    id    level   sector  sector_ald name   name_ald alias alias_ald score source
#>    <chr> <chr>   <chr>   <chr>      <chr>  <chr>    <chr> <chr>     <dbl> <chr> 
#>  1 UP23  ultima… automo… automotive Aston… aston m… asto… astonmar…     1 loanb…
#>  2 UP23  direct… automo… automotive <NA>   aston m… asto… astonmar…     1 loanb…
#>  3 UP23  interm… automo… automotive <NA>   aston m… asto… astonmar…     1 loanb…
#>  4 UP25  ultima… automo… automotive Avtoz… avtozaz  avto… avtozaz       1 loanb…
#>  5 UP25  direct… automo… automotive <NA>   avtozaz  avto… avtozaz       1 loanb…
#>  6 UP25  interm… automo… automotive <NA>   avtozaz  avto… avtozaz       1 loanb…
#>  7 UP36  ultima… automo… automotive Bogdan bogdan   bogd… bogdan        1 loanb…
#>  8 UP36  direct… automo… automotive <NA>   bogdan   bogd… bogdan        1 loanb…
#>  9 UP36  interm… automo… automotive <NA>   bogdan   bogd… bogdan        1 loanb…
#> 10 UP52  ultima… automo… automotive Ch Au… ch auto  chau… chauto        1 loanb…
#> # … with 1,340 more rows

# Notice "!"
match_name(loanbook_demo, ald_demo) %>% 
  select_if(.predicate = ~ all(is.na(.x)))
#> # A tibble: 1,350 x 16
#>    id_loan id_direct_loant… id_intermediate… id_ultimate_par… loan_size_outst…
#>    <chr>   <chr>            <chr>            <chr>                       <dbl>
#>  1 <NA>    <NA>             <NA>             <NA>                           NA
#>  2 <NA>    <NA>             <NA>             <NA>                           NA
#>  3 <NA>    <NA>             <NA>             <NA>                           NA
#>  4 <NA>    <NA>             <NA>             <NA>                           NA
#>  5 <NA>    <NA>             <NA>             <NA>                           NA
#>  6 <NA>    <NA>             <NA>             <NA>                           NA
#>  7 <NA>    <NA>             <NA>             <NA>                           NA
#>  8 <NA>    <NA>             <NA>             <NA>                           NA
#>  9 <NA>    <NA>             <NA>             <NA>                           NA
#> 10 <NA>    <NA>             <NA>             <NA>                           NA
#> # … with 1,340 more rows, and 11 more variables:
#> #   loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>

Created on 2020-01-08 by the reprex package (v0.3.0.9001)

jdhoffa commented 4 years ago

Yup, I'm happy to do this. I'll be working a little late today, so we could even do this today if you had time?

maurolepore commented 4 years ago

Adding to the wierdness ... this is one slice that exposes the bug:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(r2dii.dataraw)
#> Loading required package: r2dii.utils
library(r2dii.match)

slice(loanbook_demo, 4:5) %>% 
  match_name(ald_demo) %>% 
  select_if(~ all(is.na(.x)))
#> # A tibble: 6 x 16
#>   id_loan id_direct_loant… id_intermediate… id_ultimate_par… loan_size_outst…
#>   <chr>   <chr>            <chr>            <chr>                       <dbl>
#> 1 <NA>    <NA>             <NA>             <NA>                           NA
#> 2 <NA>    <NA>             <NA>             <NA>                           NA
#> 3 <NA>    <NA>             <NA>             <NA>                           NA
#> 4 <NA>    <NA>             <NA>             <NA>                           NA
#> 5 <NA>    <NA>             <NA>             <NA>                           NA
#> 6 <NA>    <NA>             <NA>             <NA>                           NA
#> # … with 11 more variables: loan_size_outstanding_currency <chr>,
#> #   loan_size_credit_limit <dbl>, loan_size_credit_limit_currency <chr>,
#> #   sector_classification_system <chr>, sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>

Created on 2020-01-08 by the reprex package (v0.3.0.9001)

maurolepore commented 4 years ago

Closed along with #89