2DegreesInvesting / tiltIndicator

Indicators for the TILT project
https://2degreesinvesting.github.io/tiltIndicator/
GNU General Public License v3.0
1 stars 1 forks source link

The `sector*()` functions now preserve unmatched products #738

Closed maurolepore closed 4 months ago

maurolepore commented 4 months ago

Closes #733 Extends #639

The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sector and tilt_subsector at all, because in that case, we won't be able to find any matching scenario and hence will only have NAs for that product. -- Tilman here

--

@Tilmon and @AnneSchoenauer please see the reprexes and let me know if this is what you expect or what needs to change.

sector_profile*() now:

reprex: sector_profile()

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

options(tibble.print_max = Inf, width = 500)

companies <- tribble(
    ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector, ~type,     ~sector, ~subsector,
              "a",        "a",                         "a",          "a",             "a", "ipr",     "total",   "energy",
              "a",        "a",                         "b",          "a",             "a", "ipr",     "total",   "energy",
              "a",        "a",                 "unmatched",          "a",             "a", "ipr", "unmatched",   "energy"
)

scenarios <- tribble(
    ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
    "total",   "energy", "2050",         "1", "ipr",       "a"
)

result <- sector_profile(companies, scenarios)

result |> unnest_product()
#> # A tibble: 3 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          a           a        2050  ipr   a             
#> 2 a            ipr_a_2050 high          1               a         b                          a           a        2050  ipr   a             
#> 3 a            <NA>       <NA>          <NA>            a         unmatched                  a           <NA>     <NA>  ipr   a

result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.667
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0    
#> 4 a            ipr_a_2050 <NA>          0.333

reprex: sector_profile_upstream()

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

options(tibble.print_max = Inf, width = 500)

companies <- tibble::tribble(
    ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector,
    "a",        "a",                         "a",          "a",
    "a",        "a",                         "b",          "a",
    "a",        "a",                 "unmatched",          "a"
)

scenarios <- tribble(
    ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
    "total",   "energy", "2050",         "1", "ipr",       "a"
)

inputs <- tribble(
    ~sector, ~activity_uuid_product_uuid, ~input_activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector, ~input_unit, ~input_isic_4digit, ~input_co2_footprint, ~type, ~subsector,
    "total",                         "a",                               "a",                "a",                   "a",         "a",           "'1234'",                    1, "ipr",   "energy",
    "total",                         "b",                               "a",                "a",                   "a",         "a",           "'1234'",                    1, "ipr",   "energy"
)

result <- sector_profile_upstream(companies, scenarios, inputs)

result |> unnest_product()
#> # A tibble: 3 × 13
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  input_activity_uuid_product_uuid input_tilt_sector input_tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>                            <chr>             <chr>               
#> 1 a            ipr_a_2050 high          1               a         a                          a           a        2050  ipr   a                                a                 a                   
#> 2 a            ipr_a_2050 high          1               a         b                          a           a        2050  ipr   a                                a                 a                   
#> 3 a            <NA>       <NA>          <NA>            a         unmatched                  a           <NA>     <NA>  <NA>  <NA>                             <NA>              <NA>

result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.667
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0    
#> 4 a            ipr_a_2050 <NA>          0.333


TODO

EXCEPTIONS

AnneSchoenauer commented 4 months ago

Thanks, Mauro.

as01. @Tilmon quick question if we have a product that can only be matched to IEA but not to IPR do we preserve this as well? If not shouldn't it be here as well?

maurolepore commented 4 months ago

as01. @AnneSchoenauer here I adapted the reprex to show the example when the unmatched product results from a mismatch in the type of scenario. I hope this helps in making concrete the conversation with Tilman.

reprex

Note the companies dataset has a product with activity_uuid_product_uuid = "a" both for type = "ipr" and also type = "iea" but the sector and subsector in that company are such that this specific product matches the scenario dataset only where type = "iea" (it lacks type = "ipr" for that combination of sector, subsector, and year).

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

options(tibble.print_max = Inf, width = 500)

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector, ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",   "energy",
            "a",        "a",                         "a",          "a",             "a",       "iea",     "total",   "energy",
)

scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "iea",       "a"
)

result <- sector_profile(companies, scenarios)

result |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            <NA>       <NA>          <NA>            a         a                          a           <NA>     <NA>  ipr   a             
#> 2 a            iea_a_2050 high          1               a         a                          a           a        2050  iea   a

result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            iea_a_2050 high            0.5
#> 2 a            iea_a_2050 medium          0  
#> 3 a            iea_a_2050 low             0  
#> 4 a            iea_a_2050 <NA>            0.5

ml01. Note the output at company level is similar to the new output of emissions*() but it seems incorrect. @AnneSchoenauer and @Tilmon could you "draw" the ideal output for this particular case?

# ml01.1. Bad?
result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            iea_a_2050 high            0.5
#> 2 a            iea_a_2050 medium          0  
#> 3 a            iea_a_2050 low             0  
#> 4 a            iea_a_2050 <NA>            0.5  # <- this seems wrong because the `NA` comes not from "iea" but from "ipr".

# ml01.2. Better?
result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            iea_a_2050 high            0.5
#> 2 a            iea_a_2050 medium          0  
#> 3 a            iea_a_2050 low             0  
#> 4 a            <NA>       <NA>            0.5  # <- this seems better (similar to Tilman's idea of the new `NA` (or `no_match`) benchmark
Tilmon commented 4 months ago

Dear @maurolepore ,

thanks for providing these insightful reprexes.

ml01. Note the output at company level is similar to the new output of emissions*() but it seems incorrect. @AnneSchoenauer and @Tilmon could you "draw" the ideal output for this particular case?

Actually, I think it's fine. Or moreover, it is what we need and want. It may seem a bit odd because in your reprex, you only use one scenario instead of both. When using both scenarios, one will see the 0.5 NAs in both grouped_by (or in the real data in the 4 grouped_by, because IPR 2030, IPR 2050, WEO 2030, WEO 2050) , in the same way as it is in the emission_profile with the 6 grouped_by.

I created a slightly extended and more realistic sample dataset in this Google Sheet (my reprex skills have not changed since last week, hence this is the only way for me to share tables with that level of detail with you, but willing to learn reprexes as discussed today!) which contains three clustered where

You'll see in the results that, similar to the emission_profile:

Please not:

cc' @AnneSchoenauer

maurolepore commented 4 months ago

Thanks @Tilmon for your expample (here) and for this explanation"

The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sector and tilt_subsector at all.

This PR focuses on the "unmatched products" case exclusively. I took the data from your spreadsheet and picked only the relevant rows (note tibble::tribble() helps create and share data in using a spreadsheet-like format).

The output is a little different because but it seems to makes sense considering the input data excluded the rows that belong to the case with a "missing benchmark" (#739 ). Just in case please confirm.

reprex

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9210'

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched"
)

scenarios <- tribble(
  ~sector, ~subsector, ~year, ~reductions, ~type, ~scenario,
  "total",   "energy",  2050,           1, "ipr",       "a",
  "total",   "energy",  2050,         0.6, "weo",       "a"
)

result <- sector_profile(companies, scenarios)

result |> unnest_product()
#> # A tibble: 3 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> 3 a            <NA>       <NA>                     NA   b        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

result |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high            0.5
#> 2 a            ipr_a_2050 medium          0  
#> 3 a            ipr_a_2050 low             0  
#> 4 a            ipr_a_2050 <NA>            0.5
#> 5 a            weo_a_2050 high            0  
#> 6 a            weo_a_2050 medium          0.5
#> 7 a            weo_a_2050 low             0  
#> 8 a            weo_a_2050 <NA>            0.5

As you consider this "is what we need and want" I'll polish this PR then extend it in #739 to include the case with a "missing benchmark". When that case is done I'll be able to use your full example and should get the same result.

maurolepore commented 4 months ago

The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sector and tilt_subsector at all, because in that case, we won't be able to find any matching scenario and hence will only have NAs for that product. --@Tilmon

@Tilmon, FYI I just notice that this conceptual truth can be untrue.

The reprex below shows that "unmatched" values in tilt_sector and tilt_subsector alone do not yield NAs. Instead what drives the NA are "unmatched" values in either sector, subsector and type.

reprex

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9210'
withr::local_options(list(tibble.print_max = Inf, width = 500))

# An "unmatched" value in `tilt_sector` or `tilt_subsector` does NOT yield `NA`
companies <- tribble(
  ~companies_id, ~activity_uuid_product_uuid, ~clustered, ~tilt_sector, ~sector, ~subsector, ~tilt_subsector, ~type,
  "a",                         "a",        "a",      "total", "total",   "energy",             "a", "ipr",
  "b",                         "b",        "b",  "unmatched", "total",   "energy",             "a", "ipr"
)
scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          total       a        2050  ipr   a             
#> 2 b            ipr_a_2050 high          1               b         b                          unmatched   a        2050  ipr   a

# What does yield `NA` is an "unmatched" value in `sector` or `subsector`.
companies <- tribble(
  ~companies_id, ~activity_uuid_product_uuid, ~clustered, ~tilt_sector,     ~sector, ~subsector, ~tilt_subsector, ~type,
  "a",                         "a",        "a",      "total",     "total",   "energy",             "a", "ipr",
  "b",                         "b",        "b",      "total", "unmatched",   "energy",             "a", "ipr"
)
scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          total       a        2050  ipr   a             
#> 2 b            <NA>       <NA>          <NA>            b         b                          total       <NA>     <NA>  ipr   a

# Or in `type`
companies <- tribble(
  ~companies_id, ~activity_uuid_product_uuid, ~clustered, ~tilt_sector,     ~sector, ~subsector, ~tilt_subsector,       ~type,
  "a",                         "a",        "a",      "total",     "total",   "energy",             "a",       "ipr",
  "b",                         "b",        "b",      "total",     "total",   "energy",             "a", "unmatched"
)
scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type      tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr>     <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          total       a        2050  ipr       a             
#> 2 b            <NA>       <NA>          <NA>            b         b                          total       <NA>     <NA>  unmatched a

I suspect your statement is true in real practice, likely because in the real data tilt_sector or tilt_subsector should always be "unmatched" when sector, subsector or type are unmatched. But currently the code does not know about this relationship. If it is an important one, please confirm and I'll open an issue to encode it in a warning or error.

If instead this suggests a bug let me know so we fix it.

Tilmon commented 4 months ago

@maurolepore thanks, you are right!

I suspect your statement is true in real practice, likely because in the real data tilt_sector or tilt_subsector should always be "unmatched" when sector, subsector or type are unmatched. But currently the code does not know about this relationship. If it is an important one, please confirm and I'll open an issue to encode it in a warning or error.

That's also correct. We always start with a tilt_sectorfor each product in the data prep. Every tilt_sectorleads to at least one sector (either ipr or weo or both). If we don't have a tilt_sector, we don't have a sector. But I agree that in this case the unmatched value in sector is the important relationship.

Thanks for double checking.