The `sector*()` functions now preserve unmatched products

Closes #733 Extends #639

The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sector and tilt_subsector at all, because in that case, we won't be able to find any matching scenario and hence will only have NAs for that product. -- Tilman here

@Tilmon and @AnneSchoenauer please see the reprexes and let me know if this is what you expect or what needs to change.

sector_profile*() now:

preserves unmatched products at product level.
adds the unmatched products to the value at company level.

reprex: sector_profile()

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

options(tibble.print_max = Inf, width = 500)

companies <- tribble(
    ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector, ~type,     ~sector, ~subsector,
              "a",        "a",                         "a",          "a",             "a", "ipr",     "total",   "energy",
              "a",        "a",                         "b",          "a",             "a", "ipr",     "total",   "energy",
              "a",        "a",                 "unmatched",          "a",             "a", "ipr", "unmatched",   "energy"
)

scenarios <- tribble(
    ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
    "total",   "energy", "2050",         "1", "ipr",       "a"
)

result <- sector_profile(companies, scenarios)

result |> unnest_product()
#> # A tibble: 3 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          a           a        2050  ipr   a             
#> 2 a            ipr_a_2050 high          1               a         b                          a           a        2050  ipr   a             
#> 3 a            <NA>       <NA>          <NA>            a         unmatched                  a           <NA>     <NA>  ipr   a

result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.667
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0    
#> 4 a            ipr_a_2050 <NA>          0.333

reprex: sector_profile_upstream()

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

options(tibble.print_max = Inf, width = 500)

companies <- tibble::tribble(
    ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector,
    "a",        "a",                         "a",          "a",
    "a",        "a",                         "b",          "a",
    "a",        "a",                 "unmatched",          "a"
)

scenarios <- tribble(
    ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
    "total",   "energy", "2050",         "1", "ipr",       "a"
)

inputs <- tribble(
    ~sector, ~activity_uuid_product_uuid, ~input_activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector, ~input_unit, ~input_isic_4digit, ~input_co2_footprint, ~type, ~subsector,
    "total",                         "a",                               "a",                "a",                   "a",         "a",           "'1234'",                    1, "ipr",   "energy",
    "total",                         "b",                               "a",                "a",                   "a",         "a",           "'1234'",                    1, "ipr",   "energy"
)

result <- sector_profile_upstream(companies, scenarios, inputs)

result |> unnest_product()
#> # A tibble: 3 × 13
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  input_activity_uuid_product_uuid input_tilt_sector input_tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>                            <chr>             <chr>               
#> 1 a            ipr_a_2050 high          1               a         a                          a           a        2050  ipr   a                                a                 a                   
#> 2 a            ipr_a_2050 high          1               a         b                          a           a        2050  ipr   a                                a                 a                   
#> 3 a            <NA>       <NA>          <NA>            a         unmatched                  a           <NA>     <NA>  <NA>  <NA>                             <NA>              <NA>

result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.667
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0    
#> 4 a            ipr_a_2050 <NA>          0.333

TODO

[x] Link related issue/PR.
[x] Describe the goal of the PR. Avoid details that are clear in the diff.
[x] Mark the PR as draft.
[x] Include a unit test.
[x] Review your own PR in "Files changed".
[x] Ensure the PR branch is updated.
[ ] Ensure the checks pass.
[ ] Change the status from draft to ready.
[x] Polish the PR title and description.
[x] Assign a reviewer.

EXCEPTIONS

[ ] Slide here any item that you intentionally choose to not do.

Thanks, Mauro.

as01. @Tilmon quick question if we have a product that can only be matched to IEA but not to IPR do we preserve this as well? If not shouldn't it be here as well?

as01. @AnneSchoenauer here I adapted the reprex to show the example when the unmatched product results from a mismatch in the type of scenario. I hope this helps in making concrete the conversation with Tilman.

reprex

Note the companies dataset has a product with activity_uuid_product_uuid = "a" both for type = "ipr" and also type = "iea" but the sector and subsector in that company are such that this specific product matches the scenario dataset only where type = "iea" (it lacks type = "ipr" for that combination of sector, subsector, and year).

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

options(tibble.print_max = Inf, width = 500)

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector, ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",   "energy",
            "a",        "a",                         "a",          "a",             "a",       "iea",     "total",   "energy",
)

scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "iea",       "a"
)

result <- sector_profile(companies, scenarios)

result |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            <NA>       <NA>          <NA>            a         a                          a           <NA>     <NA>  ipr   a             
#> 2 a            iea_a_2050 high          1               a         a                          a           a        2050  iea   a

result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            iea_a_2050 high            0.5
#> 2 a            iea_a_2050 medium          0  
#> 3 a            iea_a_2050 low             0  
#> 4 a            iea_a_2050 <NA>            0.5

ml01. Note the output at company level is similar to the new output of emissions*() but it seems incorrect. @AnneSchoenauer and @Tilmon could you "draw" the ideal output for this particular case?

# ml01.1. Bad?
result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            iea_a_2050 high            0.5
#> 2 a            iea_a_2050 medium          0  
#> 3 a            iea_a_2050 low             0  
#> 4 a            iea_a_2050 <NA>            0.5  # <- this seems wrong because the `NA` comes not from "iea" but from "ipr".

# ml01.2. Better?
result |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            iea_a_2050 high            0.5
#> 2 a            iea_a_2050 medium          0  
#> 3 a            iea_a_2050 low             0  
#> 4 a            <NA>       <NA>            0.5  # <- this seems better (similar to Tilman's idea of the new `NA` (or `no_match`) benchmark

Dear @maurolepore ,

thanks for providing these insightful reprexes.

ml01. Note the output at company level is similar to the new output of emissions*() but it seems incorrect. @AnneSchoenauer and @Tilmon could you "draw" the ideal output for this particular case?

Actually, I think it's fine. Or moreover, it is what we need and want. It may seem a bit odd because in your reprex, you only use one scenario instead of both. When using both scenarios, one will see the 0.5 NAs in both grouped_by (or in the real data in the 4 grouped_by, because IPR 2030, IPR 2050, WEO 2030, WEO 2050) , in the same way as it is in the emission_profile with the 6 grouped_by.

I created a slightly extended and more realistic sample dataset in this Google Sheet (my reprex skills have not changed since last week, hence this is the only way for me to share tables with that level of detail with you, but willing to learn reprexes as discussed today!) which contains three clustered where

a is matched + contains a sector covered in IPR & WEO
b is unmatched to ecoinvent and sectors and hence has no results
c is matched to a sector only covered by IPR, i.e. only has IPR results

You'll see in the results that, similar to the emission_profile:

on product-level, I suggest to use the grouped_by NA for the product without any results
on company-level, I suggest to stick to the grouped_by we already have and simply add the risk_category NA for the products without results (either because of missing sector-match to scenario or because of no sector data of the product at all).
in the end, one will see that the overlap between the NAs of all grouped_by indicates the share of products without any results because no sector data available at all

Please not:

the "missing benchmark" from the emission_profile is the equivalent to now having a clustered with a tilt_sectorand tilt_subsector but that doesn't have a corresponding sectoror subsector for either of the scenario type
The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sectorand tilt_subsector at all, because in that case, we won't be able to find any matching scenario and hence will only have NAs for that product.

cc' @AnneSchoenauer

Thanks @Tilmon for your expample (here) and for this explanation"

The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sector and tilt_subsector at all.

This PR focuses on the "unmatched products" case exclusively. I took the data from your spreadsheet and picked only the relevant rows (note tibble::tribble() helps create and share data in using a spreadsheet-like format).

The output is a little different because but it seems to makes sense considering the input data excluded the rows that belong to the case with a "missing benchmark" (#739 ). Just in case please confirm.

reprex

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9210'

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched"
)

scenarios <- tribble(
  ~sector, ~subsector, ~year, ~reductions, ~type, ~scenario,
  "total",   "energy",  2050,           1, "ipr",       "a",
  "total",   "energy",  2050,         0.6, "weo",       "a"
)

result <- sector_profile(companies, scenarios)

result |> unnest_product()
#> # A tibble: 3 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> 3 a            <NA>       <NA>                     NA   b        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

result |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high            0.5
#> 2 a            ipr_a_2050 medium          0  
#> 3 a            ipr_a_2050 low             0  
#> 4 a            ipr_a_2050 <NA>            0.5
#> 5 a            weo_a_2050 high            0  
#> 6 a            weo_a_2050 medium          0.5
#> 7 a            weo_a_2050 low             0  
#> 8 a            weo_a_2050 <NA>            0.5

As you consider this "is what we need and want" I'll polish this PR then extend it in #739 to include the case with a "missing benchmark". When that case is done I'll be able to use your full example and should get the same result.

The "unmatched product" from the emission_profile is the equivalent to NOT having a tilt_sector and tilt_subsector at all, because in that case, we won't be able to find any matching scenario and hence will only have NAs for that product. --@Tilmon

@Tilmon, FYI I just notice that this conceptual truth can be untrue.

The reprex below shows that "unmatched" values in tilt_sector and tilt_subsector alone do not yield NAs. Instead what drives the NA are "unmatched" values in either sector, subsector and type.

reprex

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9210'
withr::local_options(list(tibble.print_max = Inf, width = 500))

# An "unmatched" value in `tilt_sector` or `tilt_subsector` does NOT yield `NA`
companies <- tribble(
  ~companies_id, ~activity_uuid_product_uuid, ~clustered, ~tilt_sector, ~sector, ~subsector, ~tilt_subsector, ~type,
  "a",                         "a",        "a",      "total", "total",   "energy",             "a", "ipr",
  "b",                         "b",        "b",  "unmatched", "total",   "energy",             "a", "ipr"
)
scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          total       a        2050  ipr   a             
#> 2 b            ipr_a_2050 high          1               b         b                          unmatched   a        2050  ipr   a

# What does yield `NA` is an "unmatched" value in `sector` or `subsector`.
companies <- tribble(
  ~companies_id, ~activity_uuid_product_uuid, ~clustered, ~tilt_sector,     ~sector, ~subsector, ~tilt_subsector, ~type,
  "a",                         "a",        "a",      "total",     "total",   "energy",             "a", "ipr",
  "b",                         "b",        "b",      "total", "unmatched",   "energy",             "a", "ipr"
)
scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          total       a        2050  ipr   a             
#> 2 b            <NA>       <NA>          <NA>            b         b                          total       <NA>     <NA>  ipr   a

# Or in `type`
companies <- tribble(
  ~companies_id, ~activity_uuid_product_uuid, ~clustered, ~tilt_sector,     ~sector, ~subsector, ~tilt_subsector,       ~type,
  "a",                         "a",        "a",      "total",     "total",   "energy",             "a",       "ipr",
  "b",                         "b",        "b",      "total",     "total",   "energy",             "a", "unmatched"
)
scenarios <- tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type      tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr>     <chr>         
#> 1 a            ipr_a_2050 high          1               a         a                          total       a        2050  ipr       a             
#> 2 b            <NA>       <NA>          <NA>            b         b                          total       <NA>     <NA>  unmatched a

I suspect your statement is true in real practice, likely because in the real data tilt_sector or tilt_subsector should always be "unmatched" when sector, subsector or type are unmatched. But currently the code does not know about this relationship. If it is an important one, please confirm and I'll open an issue to encode it in a warning or error.

If instead this suggests a bug let me know so we fix it.

@maurolepore thanks, you are right!

I suspect your statement is true in real practice, likely because in the real data tilt_sector or tilt_subsector should always be "unmatched" when sector, subsector or type are unmatched. But currently the code does not know about this relationship. If it is an important one, please confirm and I'll open an issue to encode it in a warning or error.

That's also correct. We always start with a tilt_sectorfor each product in the data prep. Every tilt_sectorleads to at least one sector (either ipr or weo or both). If we don't have a tilt_sector, we don't have a sector. But I agree that in this case the unmatched value in sector is the important relationship.

Thanks for double checking.

2DegreesInvesting / tiltIndicator

The `sector*()` functions now preserve unmatched products #738