2DegreesInvesting / coding-helpdesk

Live support by appointment
0 stars 0 forks source link

Make a reprex for r2dii.analysis/issues/255 #24

Closed maurolepore closed 2 years ago

maurolepore commented 3 years ago

“Finding your bug is a process of confirming the many things that you believe are true — until you find one which is not true.” —Norm Matlof (cited in Advanced R, Debugging techniques).

There are two parts to creating a reprex: First, you need to make your code reproducible. (...) Second, you need to make it minimal. -- https://www.tidyverse.org/help/#reprex

--

What is the issue you need help with?

Make a reprex for https://github.com/2DegreesInvesting/r2dii.analysis/issues/255.

Can you provide a reproducible example? Why not?

We plan to develop the reprex during the meeting.

Problem:

"unweighted results are coming out wrong when there are multiple loans to the same company"

Questions:

Resources

cc' @georgeharris2deg

maurolepore commented 3 years ago

This is adapted from the original reprex. The result is as expected, suggesting the issue that Jackson noticed is indeed fixed.

@georgeharris2deg,

  1. Can you please run this reprex and confirm you get the same result?
  2. Can you share a different input to expose the issue you see?
library(dplyr, warn.conflicts = FALSE)
library(r2dii.data)
packageVersion("r2dii.data")
#> [1] '0.1.6'
library(r2dii.analysis)
#  The original bug was fixed in 0.1.2
# https://2degreesinvesting.github.io/r2dii.analysis/news/index.html#r2dii-analysis-0-1-2-2020-12-05
packageVersion("r2dii.analysis")
#> [1] '0.1.3'

# Toy data

one_loan <- tibble(
  id_loan = 1,
  name_ald = "shaanxi auto",
  loan_size_outstanding = 1,
  loan_size_outstanding_currency = "EUR",
  loan_size_credit_limit = 2,
  loan_size_credit_limit_currency = "EUR",
  id_2dii = "UP1",
  level = "ultimate_parent",
  score = 1,
  sector = "automotive",
  sector_ald = "automotive"
)

# The only difference is the first column `id_loan`
two_loan <- one_loan %>%
  bind_rows(one_loan) %>%
  mutate(
    id_loan = 1:2,
    # I suspect the problem might be here
    #  (some cases may make no sense, but it seems like a good place to start)
    #       What happens if more columns are different?  # x: expected; v: not
    # name_ald = "shaanxi auto",                         # must be the same
    # loan_size_outstanding = 1:2,                       # x
    # loan_size_outstanding_currency = c("EUR", "USD"),  # x
    # loan_size_credit_limit = 2:3,                      # x
    # loan_size_credit_limit_currency = c("EUR", "USD"), # x
    # id_2dii = c("UP1", "UP2"),                         # x
    # level = c("ultimate_parent", "direct_loantaker"),  # v
    # score = 1:0.9,                                     # v
    # sector = c("automotive", "other"),                 # x
    # sector_ald = c("automotive", "other")              # v
  )

ald <- tibble(
  name_company = "shaanxi auto",
  sector = "automotive",
  technology = "ice",
  year = 2025,
  production = 1,
  emission_factor = 1,
  plant_location = "BF",
  is_ultimate_owner = TRUE
)

scenario <- tibble(
  scenario = "sds",
  sector = "automotive",
  technology = "ice",
  region = "global",
  year = 2025,
  tmsr = 0.5,
  smsp = -0.08,
  scenario_source = "demo_2020"
)

one_loan
#> # A tibble: 1 x 11
#>   id_loan name_ald loan_size_outst… loan_size_outst… loan_size_credi…
#>     <dbl> <chr>               <dbl> <chr>                       <dbl>
#> 1       1 shaanxi…                1 EUR                             2
#> # … with 6 more variables: loan_size_credit_limit_currency <chr>,
#> #   id_2dii <chr>, level <chr>, score <dbl>, sector <chr>, sector_ald <chr>

two_loan
#> # A tibble: 2 x 11
#>   id_loan name_ald loan_size_outst… loan_size_outst… loan_size_credi…
#>     <int> <chr>               <dbl> <chr>                       <dbl>
#> 1       1 shaanxi…                1 EUR                             2
#> 2       2 shaanxi…                1 EUR                             2
#> # … with 6 more variables: loan_size_credit_limit_currency <chr>,
#> #   id_2dii <chr>, level <chr>, score <dbl>, sector <chr>, sector_ald <chr>

one_result <- one_loan %>%
  target_market_share(
    ald,
    scenario,
    region_isos_demo,
    by_company = TRUE,
    weight_production = FALSE
  )

two_result <- two_loan %>%
  target_market_share(
    ald,
    scenario,
    region_isos_demo,
    by_company = TRUE,
    weight_production = FALSE
  )

# > I would expect the unweighted company-level production output to be
# identical -- @jdhoffa
# https://github.com/2DegreesInvesting/r2dii.analysis/issues/239#issuecomment-737278398

identical(one_result, two_result)
#> [1] TRUE
testthat::expect_equal(one_result, two_result)

Created on 2020-12-23 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os Ubuntu 18.04.5 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_US:en #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Argentina/Buenos_Aires #> date 2020-12-23 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.3) #> callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.3) #> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.3) #> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) #> devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr * 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.3) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0) #> htmltools 0.5.0.9003 2020-12-14 [1] Github (rstudio/htmltools@d18bd8e) #> knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3) #> pkgbuild 1.2.0 2020-12-15 [1] RSPM (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) #> processx 3.4.5 2020-11-30 [1] RSPM (R 4.0.3) #> ps 1.5.0 2020-12-05 [1] RSPM (R 4.0.3) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0) #> r2dii.analysis * 0.1.3 2020-12-15 [1] CRAN (R 4.0.3) #> r2dii.data * 0.1.6 2020-12-05 [1] RSPM (R 4.0.3) #> R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3) #> remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) #> rlang 0.4.9.9000 2020-12-17 [1] Github (r-lib/rlang@bfb1d41) #> rmarkdown 2.6 2020-12-14 [1] RSPM (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] RSPM (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) #> stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) #> testthat 3.0.1 2020-12-17 [1] RSPM (R 4.0.3) #> tibble 3.0.4 2020-10-12 [1] CRAN (R 4.0.3) #> tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0) #> usethis 2.0.0.9000 2020-12-11 [1] Github (r-lib/usethis@f96bf2e) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.3) #> vctrs 0.3.6 2020-12-17 [1] RSPM (R 4.0.3) #> withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) #> xfun 0.19 2020-10-30 [1] CRAN (R 4.0.3) #> yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0) #> #> [1] /home/mauro/R/x86_64-pc-linux-gnu-library/4.0 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
georgeharris2deg commented 3 years ago

@maurolepore as discussed earlier - here is an anonymised and striped down reprex of the error. Thanks !

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
library(r2dii.data)
#> Warning: package 'r2dii.data' was built under R version 4.0.3
library(r2dii.match)
#> Warning: package 'r2dii.match' was built under R version 4.0.3
library(r2dii.analysis)
#> Warning: package 'r2dii.analysis' was built under R version 4.0.3
packageVersion("r2dii.data")
#> [1] '0.1.6'
packageVersion("r2dii.match")
#> [1] '0.0.7'
packageVersion("r2dii.analysis")
#> [1] '0.1.3'

#loanbook with multiple loans to one company 

lbk <- tibble::tribble(
     ~id_loan, ~id_direct_loantaker, ~name_direct_loantaker, ~id_intermediate_parent_1, ~name_intermediate_parent_1, ~id_ultimate_parent, ~name_ultimate_parent, ~loan_size_outstanding, ~loan_size_outstanding_currency, ~loan_size_credit_limit, ~loan_size_credit_limit_currency, ~sector_classification_system, ~sector_classification_input_type, ~sector_classification_direct_loantaker, ~fi_type, ~flag_project_finance_loan, ~name_project, ~lei_direct_loantaker, ~isin_direct_loantaker,
         "L1",               "DL1",              "company A",                        NA,                          NA,              "UP16",             "company A",                 225626,                           "EUR",                18968806,                            "EUR",                        "NACE",                            "Code",                                    3511,   "Loan",                       "No",            NA,                    NA,                     NA,
         "L2",               "DL1",              "company A",                        NA,                          NA,              "UP16",             "company A",                   4321,                           "EUR",                   44333,                            "EUR",                        "NACE",                            "Code",                                    3511,   "Loan",                       "No",            NA,                    NA,                     NA,
     )

# ald of one company 

ald <- tibble::tribble(
    ~name_company, ~sector, ~technology, ~year, ~production, ~production_unit, ~emission_factor, ~ald_emission_factor_unit, ~plant_location, ~is_ultimate_owner, ~ald_timestamp,
        "company A", "power",   "coalcap",  2020,         50,             "MW",               NA,                        NA,            "IT",               TRUE,       "2019Q4"
    )

# load scenerio file

scenario <- r2dii.data::scenario_demo_2020
scenario <- scenario %>% mutate(scenario_source = "weo_2019")

# load region file

region <- r2dii.data::region_isos

# match ALD to lbk

match_file <- match_name(lbk,ald)

# prooritize matches 

lbk_ready <- prioritize(match_file)

# run company level results 

company_results <- target_market_share(lbk_ready, ald, scenario, region, by_company = TRUE, weight_production = FALSE)

# expected value for company level results should be the total production present in the ad for a given company 

expected <- ald$production

# the actual value is outputed in the company level results output and recorded as the metric = projected with all the respective filters  

actual <- company_results %>% filter(
  region == "global", 
  scenario_source == "weo_2019",
  technology == "coalcap",
  metric == "projected"
)

# logical test to see is expected = actual 

identical(actual$production, expected)
#> [1] FALSE

Created on 2020-12-23 by the reprex package (v0.3.0)

maurolepore commented 3 years ago

Awesome reprex @georgeharris2deg! I see the problem is exposed only when loanbook has the column loan_size_credit_limit, or loan_size_outstanding, or both.

Does this give any clue about what the solution might be?

library(dplyr, warn.conflicts = FALSE)

library(r2dii.data)
packageVersion("r2dii.data")
#> [1] '0.1.6'
library(r2dii.match)
packageVersion("r2dii.match")
#> [1] '0.0.8'
library(r2dii.analysis)
packageVersion("r2dii.analysis")
#> [1] '0.1.3'

# Helpers to run all code with different loanbooks -- used at the  --------

target_market_share_by_company_unweighted <- function(lbk, ald) {
  target_market_share(
    lbk, ald,
    scenario = mutate(scenario_demo_2020, scenario_source = "weo_2019"),
    region_isos = region_isos,
    by_company = TRUE,
    weight_production = FALSE
  )
}

pick_global_projections_for_coalcap <- function(data) {
  data %>%
    filter(
      region == "global",
      scenario_source == "weo_2019",
      technology == "coalcap",
      metric == "projected"
    )
}

testit <- function(lbk, ald) {
  lbk %>%
   match_name(ald) %>%
    prioritize() %>%
    target_market_share_by_company_unweighted(ald) %>%
    pick_global_projections_for_coalcap()
}

# Minial data -------------------------------------------------------------

lbk <- tribble(
            # bad                     # bad
  ~id_loan, ~loan_size_credit_limit, ~loan_size_outstanding, ~id_ultimate_parent, ~name_ultimate_parent, ~id_direct_loantaker, ~name_direct_loantaker, ~sector_classification_system, ~sector_classification_direct_loantaker,
      "L1",                18968806,                 225626,              "UP16",              "comp a",                "DL1",               "comp a",                        "NACE",                                    3511,
      "L2",                   44333,                   4321,              "UP16",              "comp a",                "DL1",               "comp a",                        "NACE",                                    3511
)

ald <- tribble(
  ~name_company, ~sector, ~is_ultimate_owner, ~plant_location, ~technology, ~year, ~production,
       "comp a", "power",               TRUE,            "IT",   "coalcap",  2020,          50
)

# Expose the issue --------------------------------------------------------

# The issue is exposed only when loanbook has one or both of these columns:
# * loan_size_credit_limit
# * loan_size_outstanding

good <- lbk %>% select(-loan_size_credit_limit, -loan_size_outstanding) %>% testit(ald)
identical(good$production, ald$production)
#> [1] TRUE

bad_both <- lbk %>% testit(ald)
identical(bad_both$production, ald$production)
#> [1] FALSE

bad_credit_limit <- lbk %>% select(-loan_size_credit_limit) %>% testit(ald)
identical(bad_credit_limit$production, ald$production)
#> [1] FALSE

bad_outstanding <- lbk %>% select(-loan_size_outstanding) %>% testit(ald)
identical(bad_outstanding$production, ald$production)
#> [1] FALSE

# Does this give any clue about what the solution might be?

Created on 2020-12-23 by the reprex package (v0.3.0)

georgeharris2deg commented 3 years ago

Hmmm not really I am afraid - not that I can think of anyway. To add context in most loan books they will have both loan_size_credit_limit and loan_size_outstanding. In the r2dii.analysis::targte_market_share function the default will be to calculate the weighting for the portfolio level results i.e. when the argument by_company = FALSE (also the default) using the loan_size_outstanding. However, if a user would like to use the loan_size_credit_limit they can set the argument "credit_limit = TRUE"

Despite this, the company level results i.e By_company =TRUE should not use either loan_size_credit_limit or the loan_size_outstanding as it should be outputting an unweighted value - i.e the expected in the reprex above.

I hope that makes sense and please let me know if not.

Thanks