`sector_profile()` now accounts for unmatched `type`, `sector` or `subsector`

Closes #638
Relates to #639
Note to self: https://github.com/2DegreesInvesting/tiltIndicator/discussions/782

Given a clustered matching one but not a second type of scenario, when the scenarios dataset has the two types, then the second type and its corresponding scenario are still present in grouped_by, and the mismatch is reflected correctly in the value.

The expected behaviour is captured in this GoogleSheet and explained in https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-1977426095 (thanks @Tilmon).

TODO

[x] Link related issue/PR.
[ ] Describe the goal of the PR. Avoid details that are clear in the diff.
[x] Mark the PR as draft.
[x] Include a unit test.
[ ] Review your own PR in "Files changed".
[ ] Ensure the PR branch is updated.
[ ] Ensure the checks pass.
[ ] Change the status from draft to ready.
[ ] Polish the PR title and description.
[ ] Assign a reviewer.

EXCEPTIONS

[ ] Slide here any item that you intentionally choose to not do.

@Tilmon (cc' @AnneSchoenauer)

I'm struggling to see the difference between the case "unmatched products" and the case "missing benchmarks".

For emissions*() I could clearly understand the difference and and create different tests for the cases with "unmatched product" versus "missing benchmark". For example, to test the case "unmatched product" I could add a product in the companies dataset that did not exist in the products dataset. And to test the case "missing benchmark" I could add an NA in the isic_4digit column of the products dataset.

However, for sector*() I can't understand the difference as clearly, and I realize that my tests for both cases essentially use the same type of toy input data: An NA or "unmatched" in either the columns sector, subsector, or type of the companies dataset.

Your https://github.com/2DegreesInvesting/tiltIndicator/pull/738#issuecomment-1975997275 seems to support the idea that these two cases are no different:

I agree that in this case the unmatched value in sector is the important relationship.

The best way to clarify things is through a reproducible example that you are familiar with because it's based on the one you created in this GoogleSheet. At this point I'm so confused that I don't know if you think it shows one or both cases, but I hope it's a good start.

reprex

library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

packageVersion("tiltIndicator")
#> [1] '0.0.0.9211'

withr::local_options(list(tibble.print_max = Inf, width = 500))

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched",
            "a",        "c",                 "unmatched",          "c",             "c",       "ipr", "land use",   "land use",
            "a",        "c",                 "unmatched",          "c",             "c",       "weo",         NA,           NA
  )

scenarios <- tribble(
     ~sector,   ~subsector,  ~year, ~reductions, ~type, ~scenario,
     "total",     "energy",   2050,         1.0, "ipr",       "a",
     "total",     "energy",   2050,         0.6, "weo",       "a",
  "land use",   "land use",   2050,         0.3, "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 5 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario  year type      tilt_subsector
#>   <chr>        <chr>      <chr>                   <dbl> <chr>     <chr>                      <chr>       <chr>    <dbl> <chr>     <chr>         
#> 1 a            ipr_a_2050 high                      1   a         a                          a           a         2050 ipr       a             
#> 2 a            weo_a_2050 medium                    0.6 a         a                          a           a         2050 weo       a             
#> 3 a            <NA>       <NA>                     NA   b         unmatched                  unmatched   <NA>        NA unmatched unmatched     
#> 4 a            ipr_a_2050 low                       0.3 c         unmatched                  c           a         2050 ipr       c             
#> 5 a            <NA>       <NA>                     NA   c         unmatched                  c           <NA>        NA weo       c

sector_profile(companies, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.25 
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0.25 
#> 4 a            ipr_a_2050 <NA>          0.5  
#> 5 a            weo_a_2050 high          0    
#> 6 a            weo_a_2050 medium        0.333
#> 7 a            weo_a_2050 low           0    
#> 8 a            weo_a_2050 <NA>          0.667

It it possible that this is already all you want? If yes, we're done and I can close this PR. If not, what do you expect?

Hi @maurolepore ,

let me respond to the points you raised below. I first start with a more detailed description of how to think about the two different cases and then explain why I think the reprex is unfortunately wrong.

Description of the "two cases"

I'm struggling to see the difference between the case "unmatched products" and the case "missing benchmarks".

I must admit that I find the difference in the case of sector*() also more complicated, but it's still there. Let's think about it that way: For the sector*() it doesn't matter, whether a clusteredis matched to Ecoinvent or not. What matters is:

Whether the clusteredhas a tilt_sector. If it doesn't have a tilt_sector, the clustered will neither have a sectoror subsectorfor any of the types. Hence, you can think of the tilt_sector as the activity_uuid_product_uuid of sector*(): If the tilt_sector (activity_uuid_product_uuid) is missing, you won't have any sector data (co2_footprint) that can be used to calculate the profile. Because ultimately, the scenario sectors give us the info on the reduction targets. I.e., no scenario sector = no result at all. Hence, tilt_sector== unmatchedshould be have as the activity_uuid_product_uuid== unmatched in the emission*().
If the tilt_sector & tilt_subsector leads to a sector& subsector for either of the type ipr or weo or both or none. You can think of the sectorx subsectorx typex year combination as the 6 benchmarks in the emission*() . For each clusteredwith a tilt_sector, we want to show all benchmarks, i.e. all corresponding combinations of sectorx subsectorx typex year for the specific tilt_sectorx tilt_subsector, even if some are NAs. I.e., for every clusteredwith a tilt_sector, we should show the benchmarks weo_2030, weo_2050, ipr_2030, ipr_2050, even if some are NA (as in your reprex clusteredc has no corresponding weo sector) - similar to an NA in isic_4digitin emission_profile().

Reprex Your reprex shows BOTH cases.

clustered b has no tilt_sector (equivalent of no activity_uuid_product_uuid) and will hence lead to no results for the indicator (as there is no sector, equivalent to no co2_footprint). This is what I describe under 1.
clustered c has a tilt_sector (equivalent to matched product with activity_uuid_product_uuid) and hence will lead to results for the sectorx subsectorx typex yearcombination, even if there will be some NAs (equivalent to an activity_uuid_product_uuid leading to results for at least some of the 6 benchmarks).

So the reprex example is great to discuss the issue. I see two problems with the reprex you shared, one on product-level and one on company-level:

product-level: clustered c should have the grouped_byvalue "weo_a_2050" instead of "NA". We have a tilt_sector for that clustered and hence should show all benchmarks, even if they are NA.
company-level: The results show that for ipr_a_2050, there are 25% of products in high and 25% in low risk category. That can't be right, because we only have 3 products. Instead, we have 1 clustered with high, 1 with low, and one with NA benchmark , i.e. we should have for the benchmark ipr_a_2050 the following values: high 1/3, medium 0, low 1/3, NA 1/3.

I hope this helps. Let me know if it does!

P.S. I realize my comment here was not very helpful, as it's not wrong but leads to more confusion about the two different cases.

Your https://github.com/2DegreesInvesting/tiltIndicator/pull/738#issuecomment-1975997275 seems to support the idea that these two cases are no different: "But I agree that in this case the unmatched value in sector is the important relationship."

I agree that in this case the unmatched value in sector is the important relationship.

@Tilmon

RE:

product-level: clustered c should have the grouped_by value "weo_a_2050" instead of "NA".

Can you please confirm you expect 1, 2, 3, or something else?

"weo_a_2050"
"weo_NA_2050"
"weo_NA_NA"

To me "1." seems incorrect. The values of grouped_by have the format <type>_<scenario>_<year>. And a scenario and year that exist for a given type and year may not make sense for another type.

For example, if a clustered "d" matches the sector and subsector for the type "ipr", and the corresponding scenario is called "iprScenario2050", then I expect grouped_by == "ipr_iprScenario2050_2050 -- as you say. But if that same sector and subsector is unmatched for the type "weo", then expecting grouped_by == "weo_iprScenario2050_2050" seems odd.

BTW, you already know this but showing it here for the record: The new expectation conflicts with previous expectations, and changes the structure of the output. This reprex focuses on that structural change (not on the specific values which are still work-in-progreee):

At product level, before in grouped_by we got NA; now we get a on-NA value.
At company level, before we got 1 row; now we get 4.

reprex

# styler: off
companies <- tibble::tribble(
      ~sector, ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~subsector, ~tilt_sector, ~tilt_subsector, ~type,
  "unmatched",           "a",        "a",                         "a",   "energy",          "a",             "a", "ipr"
)
scenarios <- tibble::tribble(
  ~sector, ~subsector,  ~year, ~reductions, ~type, ~scenario,
  "total",   "energy", "2050",         "1", "ipr",       "a"
)

# styler: on

if (!interactive()) withr::local_options(width = 500)

# BEFORE
# Load code in the main branch
library(tiltIndicator)
packageVersion("tiltIndicator")
#> [1] '0.0.0.9221'

result_main <- sector_profile(companies, scenarios)

result_main |> unnest_product()
#> # A tibble: 1 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            <NA>       <NA>          <NA>            a         a                          a           <NA>     <NA>  ipr   a

result_main |> unnest_company()
#> # A tibble: 1 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            <NA>       <NA>             NA

# Compare ----------------------------------------------------------------

# NOW
# Load code in this PR
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9222'

result_pr <- sector_profile(companies, scenarios)

result_pr |> unnest_product()
#> # A tibble: 1 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario year  type  tilt_subsector
#>   <chr>        <chr>      <chr>         <chr>           <chr>     <chr>                      <chr>       <chr>    <chr> <chr> <chr>         
#> 1 a            ipr_NA_NA  <NA>          <NA>            a         a                          a           <NA>     <NA>  ipr   a

result_pr |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_NA_NA  high              0
#> 2 a            ipr_NA_NA  medium            0
#> 3 a            ipr_NA_NA  low               0
#> 4 a            ipr_NA_NA  <NA>              1

@Tilmon,

Here I reproduce the example from your GoogleSheet, first case by case then all at once. Please review and try to explain how we can change this output to meet your expectations.

Case by case the outputs seem to make sense but when taken together the output clearly has more values of grouped_by than the ones you expect. The solution may be to gather consolidate the rows where grouped_by is unmatched_NA_NA and weo_NA_NA. But now I need a sleeping-break so I pass the thinking ball to you :-)

reprex

devtools::load_all()
#> ℹ Loading tiltIndicator

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched",
            "a",        "c",                 "unmatched",          "c",             "c",       "ipr", "land use",   "land use",
            "a",        "c",                 "unmatched",          "c",             "c",       "weo",         NA,           NA
)

scenarios <- tribble(
     ~sector,   ~subsector,  ~year, ~reductions, ~type, ~scenario,
     "total",     "energy",   2050,         1.0, "ipr",       "a",
     "total",     "energy",   2050,         0.6, "weo",       "a",
  "land use",   "land use",   2050,         0.3, "ipr",       "a"
)

# CASE BY CASE ---------------------------------------------------------------

case_a <- filter(companies, clustered == "a")
case_a
#> # A tibble: 2 × 8
#>   companies_id clustered activity_uuid_produc…¹ tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                  <chr>       <chr>          <chr>
#> 1 a            a         a                      a           a              ipr  
#> 2 a            a         a                      a           a              weo  
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: sector <chr>, subsector <chr>

sector_profile(case_a, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(case_a, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high              1
#> 2 a            ipr_a_2050 medium            0
#> 3 a            ipr_a_2050 low               0
#> 4 a            ipr_a_2050 <NA>              0
#> 5 a            weo_a_2050 high              0
#> 6 a            weo_a_2050 medium            1
#> 7 a            weo_a_2050 low               0
#> 8 a            weo_a_2050 <NA>              0

case_b <- filter(companies, clustered == "b")
case_b
#> # A tibble: 1 × 8
#>   companies_id clustered activity_uuid_produc…¹ tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                  <chr>       <chr>          <chr>
#> 1 a            b         unmatched              unmatched   unmatched      unma…
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: sector <chr>, subsector <chr>

sector_profile(case_b, scenarios) |> unnest_product()
#> # A tibble: 1 × 11
#>   companies_id grouped_by      risk_category profile_ranking clustered
#>   <chr>        <chr>           <chr>                   <dbl> <chr>    
#> 1 a            unmatched_NA_NA <NA>                       NA b        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(case_b, scenarios) |> unnest_company()
#> # A tibble: 4 × 4
#>   companies_id grouped_by      risk_category value
#>   <chr>        <chr>           <chr>         <dbl>
#> 1 a            unmatched_NA_NA high              0
#> 2 a            unmatched_NA_NA medium            0
#> 3 a            unmatched_NA_NA low               0
#> 4 a            unmatched_NA_NA <NA>              1

case_c <- filter(companies, clustered == "c")
case_c
#> # A tibble: 2 × 8
#>   companies_id clustered activity_uuid_produc…¹ tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                  <chr>       <chr>          <chr>
#> 1 a            c         unmatched              c           c              ipr  
#> 2 a            c         unmatched              c           c              weo  
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: sector <chr>, subsector <chr>

sector_profile(case_c, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 low                       0.3 c        
#> 2 a            weo_NA_NA  <NA>                     NA   c        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(case_c, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high              0
#> 2 a            ipr_a_2050 medium            0
#> 3 a            ipr_a_2050 low               1
#> 4 a            ipr_a_2050 <NA>              0
#> 5 a            weo_NA_NA  high              0
#> 6 a            weo_NA_NA  medium            0
#> 7 a            weo_NA_NA  low               0
#> 8 a            weo_NA_NA  <NA>              1

# ALL AT ONCE ----------------------------------------------------------------

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 5 × 11
#>   companies_id grouped_by      risk_category profile_ranking clustered
#>   <chr>        <chr>           <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050      high                      1   a        
#> 2 a            weo_a_2050      medium                    0.6 a        
#> 3 a            unmatched_NA_NA <NA>                     NA   b        
#> 4 a            ipr_a_2050      low                       0.3 c        
#> 5 a            weo_NA_NA       <NA>                     NA   c        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(companies, scenarios) |> unnest_company()
#> # A tibble: 16 × 4
#>    companies_id grouped_by      risk_category value
#>    <chr>        <chr>           <chr>         <dbl>
#>  1 a            ipr_a_2050      high            0.5
#>  2 a            ipr_a_2050      medium          0  
#>  3 a            ipr_a_2050      low             0.5
#>  4 a            ipr_a_2050      <NA>            0  
#>  5 a            unmatched_NA_NA high            0  
#>  6 a            unmatched_NA_NA medium          0  
#>  7 a            unmatched_NA_NA low             0  
#>  8 a            unmatched_NA_NA <NA>            1  
#>  9 a            weo_NA_NA       high            0  
#> 10 a            weo_NA_NA       medium          0  
#> 11 a            weo_NA_NA       low             0  
#> 12 a            weo_NA_NA       <NA>            1  
#> 13 a            weo_a_2050      high            0  
#> 14 a            weo_a_2050      medium          1  
#> 15 a            weo_a_2050      low             0  
#> 16 a            weo_a_2050      <NA>            0

Hi @maurolepore, thanks for your thorough explanations and for taking the time to reproduce the example from the Google Sheet case-by-case and all-at-once. That really helps me to understand the struggles in replicating it. And it also shows how incredibly complicated it is. Therefore, I'd like to ask @AnneSchoenauer to also read through these comments carefully to validate my thinking. Maybe I'm also making things too complicated and we need to find a shortcut.

Before I get into the details, one more general questions. I see in the outputs the column profile_ranking. Does this column show the reductions? Then can we also call it that way? Or does it show something different I'm unaware of?

Now to your first question in https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-2110919379

Can you please confirm you expect 1, 2, 3, or something else?

"weo_a_2050" "weo_NA_2050" "weo_NA_NA"

I would still say, the grouped_byshould be "weo_a_2050" for clusteredc. If we know the typefor a clustered, then we can also assign a scenarioand year, no? The scenarios table you are using in https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-2111332672, contains the type-scenario-year combination weo_a_2050. Hence, I'd say every clustered which is assigned to the type weo, should be analysed with the grouped_by value that we get from the scenarios dataset for weo, in this case weo_a_2050.

Regarding https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-2111332672

Here I reproduce the example from your GoogleSheet, first case by case then all at once. Please review and try to explain how we can change this output to meet your expectations.

Case by case the outputs seem to make sense but when taken together the output clearly has more values of grouped_by than the ones you expect. The solution may be to gather consolidate the rows where grouped_by is unmatched_NA_NA and weo_NA_NA. But now I need a sleeping-break so I pass the thinking ball to you :-)

I indeed think that we somehow need to consolidate the grouped_by unmatched_NA_NA and weo_NA_NA. While the NA values behind grouped_by unmatched_NA_NA should be consolidated into both ipr_a_2050 and weo_a_2050, the NA values behind grouped_by weo_NA_NA should only be consolidated into weo_a_2050. The reason for that is:

the clustered bbehind grouped_by unmatched_NA_NA has no results for either weo or ipr, i.e. on product level, it should have the risk_category NA for both weo_a_2050 and ipr_a_2050 instead.
the clustered c behind grouped_by weo_NA_NA has results for IPR but not for WEO. Hence, on product-level it should only show NA for the grouped_byweo_a_2050 but for ipr_a_2050, it should show the actual risk_category value (as it does right now).

In essence, I believe we should in the end only have the two grouped_by ipr_a_2050 and weo_a_2050. And every clustered where we have the type "weo" should be assigned to grouped_by weo_a_2050, while every clustered where we have the type "ipr" should be assigned to grouped_by ipr_a_2050, irrespective of whether it has a sectoror subsector. If a clustered has no type at all (or as denoted in the example a type "unmatched"), it should be countred as NA for all type_scenario_year combinations in the scenarios dataset.

Does that somewhat help?

@maurolepore additional thoughts:

I was wondering how we can put my thoughts from https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-2112020721 in a clear business logic. How about the explanation below? Does that help to make things clearer? It's the same thing as above, just expressed in a different order.

The dataset scenarios determines the list of values for grouped_by that we have. And then every clustered has to be assigned to all grouped_by values. Means in the following example...

... that the only two possible grouped_by values are ipr_a_2050 and weo_a_2050. And now all clustered need to be attributed to both values. In this example...

... clustered "a" leads to risk_category low/medium/high in both grouped_by, because it has sectors for ipr & weo.
... clustered "b" leads to risk_category NA in both grouped_by, because it has no sector for neither ipr nor weo
... clustered "c" leads to risk_category low/medium/high in for ipr_a_2050 and to NA for weo_a_2050 because it has a sector only for ipr, not for weo.

Not sure if this really helps. But worth a try I hope :)

cc' @AnneSchoenauer

Dear @maurolepore I discussed the suggestions I made in the comments above with @AnneSchoenauer and she agrees that that would be the ideal way. If it's not possible, we need to explore alternatives. To put my thoughts from https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-2112327196 into concrete examples, I created a new tab in the Google Sheet (_v2) which should now reflect the "business logic". Please note that I colored all cells green that have changed - to make it easier for you to see the difference.

Thanks and please reach out if it's unclear!

Thanks @Tilmon, your comments help.

Here I'll answer this question:

I see in the outputs the column profile_ranking. Does this column show the reductions?

Yes, for the sector*() functions the column profile_ranking maps to reductions.

Then can we also call it that way?

Maybe in tiltIndicatorAfter, but tiltIndicator would not be a good place for that kind of indicator-specific change. tiltIndicator sits at the core of the system and manipulates business logic at a level that is mostly abstract and general.

The name profile_ranking may not be perfect for the sector*() functions specifically, but seemed good in that it uses a general concept from tilt's domain-specific-language that applies to all indicators. Such a general-name can then be used to programatically refer to the columns of all the outputs of all the indicators. This standardization makes code dramatically more maintainable.

For a concrete example, note how simple was the change in the related PR. This change propagates through the function cols_at_product_level() to many parts of tiltIndicator.

Similarly we use cols_at_all_levels() which also make the code easier to maintain, for example by automatically updating the name of the columns in the Value section of each indicator's helpfile. Take a moment to note that those column names are NOT mentioned explicitely in the function that generates that documentation: document_default_value()).

If you still think profile_ranking is not a good name, we can open a new issue and refer to this comment. But before we go through that trouble, consider that tiltIndicator does not face users. The output of tiltIndicator is consumed by tiltIndicatorAfter, and there is where you need to worry about the user-facing names.

You can visualize the aspirational architecture of our system using The Clean Architecture model. tiltIndicator aims to host the enterprise (yellow) and application (red) business rules, and tiltIndicatorAfter aims to host the "interface adaptors" (green).

The introduction of `profile_ranking`

@Tilmon

Here's today's update.

Good news

The code now yields exactly what you expect in your example googlesheet v1.

reprex

devtools::load_all()
#> ℹ Loading tiltIndicator

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched",
            "a",        "c",                 "unmatched",          "c",             "c",       "ipr", "land use",   "land use",
            "a",        "c",                 "unmatched",          "c",             "c",       "weo",         NA,           NA
)

scenarios <- tribble(
     ~sector,   ~subsector,  ~year, ~reductions, ~type, ~scenario,
     "total",     "energy",   2050,         1.0, "ipr",       "a",
     "total",     "energy",   2050,         0.6, "weo",       "a",
  "land use",   "land use",   2050,         0.3, "ipr",       "a"
)

# ALL AT ONCE ----------------------------------------------------------------

sector_profile(companies, scenarios) |> unnest_product() |> arrange(clustered)
#> # A tibble: 5 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> 3 a            <NA>       <NA>                     NA   b        
#> 4 a            ipr_a_2050 low                       0.3 c        
#> 5 a            weo_a_2050 <NA>                     NA   c        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(companies, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.333
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0.333
#> 4 a            ipr_a_2050 <NA>          0.333
#> 5 a            weo_a_2050 high          0    
#> 6 a            weo_a_2050 medium        0.333
#> 7 a            weo_a_2050 low           0    
#> 8 a            weo_a_2050 <NA>          0.667

Bad news

Although we're closer, this is not yet the end.

The same example doesn't work smoothly when each case is considered separately.

reprex

devtools::load_all()
#> ℹ Loading tiltIndicator

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched",
            "a",        "c",                 "unmatched",          "c",             "c",       "ipr", "land use",   "land use",
            "a",        "c",                 "unmatched",          "c",             "c",       "weo",         NA,           NA
)

scenarios <- tribble(
     ~sector,   ~subsector,  ~year, ~reductions, ~type, ~scenario,
     "total",     "energy",   2050,         1.0, "ipr",       "a",
     "total",     "energy",   2050,         0.6, "weo",       "a",
  "land use",   "land use",   2050,         0.3, "ipr",       "a"
)

# CASE BY CASE ---------------------------------------------------------------

case_a <- filter(companies, clustered == "a")
case_a
#> # A tibble: 2 × 8
#>   companies_id clustered activity_uuid_produc…¹ tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                  <chr>       <chr>          <chr>
#> 1 a            a         a                      a           a              ipr  
#> 2 a            a         a                      a           a              weo  
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: sector <chr>, subsector <chr>

sector_profile(case_a, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(case_a, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high              1
#> 2 a            ipr_a_2050 medium            0
#> 3 a            ipr_a_2050 low               0
#> 4 a            ipr_a_2050 <NA>              0
#> 5 a            weo_a_2050 high              0
#> 6 a            weo_a_2050 medium            1
#> 7 a            weo_a_2050 low               0
#> 8 a            weo_a_2050 <NA>              0

case_b <- filter(companies, clustered == "b")
case_b
#> # A tibble: 1 × 8
#>   companies_id clustered activity_uuid_produc…¹ tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                  <chr>       <chr>          <chr>
#> 1 a            b         unmatched              unmatched   unmatched      unma…
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: sector <chr>, subsector <chr>

sector_profile(case_b, scenarios) |> unnest_product()
#> Error in `dplyr_col_modify()`:
#> ! Can't recycle `grouped_by` (size 2) to size 0.

sector_profile(case_b, scenarios) |> unnest_company()
#> Error in `dplyr_col_modify()`:
#> ! Can't recycle `grouped_by` (size 2) to size 0.

case_c <- filter(companies, clustered == "c")
case_c
#> # A tibble: 2 × 8
#>   companies_id clustered activity_uuid_produc…¹ tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                  <chr>       <chr>          <chr>
#> 1 a            c         unmatched              c           c              ipr  
#> 2 a            c         unmatched              c           c              weo  
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: sector <chr>, subsector <chr>

sector_profile(case_c, scenarios) |> unnest_product()
#> # A tibble: 2 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 low                       0.3 c        
#> 2 a            weo_a_2050 <NA>                     NA   c        
#> # ℹ 6 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>, tilt_subsector <chr>

sector_profile(case_c, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high              0
#> 2 a            ipr_a_2050 medium            0
#> 3 a            ipr_a_2050 low               1
#> 4 a            ipr_a_2050 <NA>              0
#> 5 a            weo_a_2050 high              0
#> 6 a            weo_a_2050 medium            0
#> 7 a            weo_a_2050 low               0
#> 8 a            weo_a_2050 <NA>              1

Also multiple previous tests fail.

reprex

devtools::test_active_file("R/sector_profile.R")
#> 
#> [ FAIL 0 | WARN 0 | SKIP 0 | PASS 0 ]
#> [ FAIL 0 | WARN 0 | SKIP 0 | PASS 1 ]
#> [ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]
#> [ FAIL 0 | WARN 0 | SKIP 0 | PASS 3 ]
#> [ FAIL 0 | WARN 0 | SKIP 0 | PASS 4 ]
#> [ FAIL 1 | WARN 0 | SKIP 0 | PASS 4 ]
#> [ FAIL 2 | WARN 0 | SKIP 0 | PASS 4 ]
#> [ FAIL 3 | WARN 0 | SKIP 0 | PASS 4 ]
#> [ FAIL 4 | WARN 0 | SKIP 0 | PASS 4 ]
#> [ FAIL 4 | WARN 0 | SKIP 0 | PASS 5 ]
#> [ FAIL 4 | WARN 0 | SKIP 0 | PASS 6 ]
#> [ FAIL 4 | WARN 0 | SKIP 0 | PASS 7 ]
#> [ FAIL 5 | WARN 0 | SKIP 0 | PASS 7 ]
#> [ FAIL 6 | WARN 0 | SKIP 0 | PASS 7 ]
#> [ FAIL 7 | WARN 0 | SKIP 0 | PASS 7 ]
#> [ FAIL 8 | WARN 0 | SKIP 0 | PASS 7 ]
#> [ FAIL 8 | WARN 0 | SKIP 0 | PASS 8 ]
#> [ FAIL 8 | WARN 0 | SKIP 0 | PASS 9 ]
#> [ FAIL 8 | WARN 0 | SKIP 0 | PASS 10 ]
#> [ FAIL 8 | WARN 0 | SKIP 0 | PASS 11 ]
#> 
#> ── Failure ('test-sector_profile.R:44:3'): at product level, preserves unmatched products ──
#> "unmatched" %in% out[[aka("uid")]] is not TRUE
#> 
#> `actual`:   FALSE
#> `expected`: TRUE 
#> 
#> ── Failure ('test-sector_profile.R:58:3'): at product level, unmatched product yield `NA` in the expected columns ──
#> is.na(out$grouped_by) is not TRUE
#> 
#> `actual`:       
#> `expected`: TRUE
#> 
#> ── Failure ('test-sector_profile.R:59:3'): at product level, unmatched product yield `NA` in the expected columns ──
#> is.na(out$risk_category) is not TRUE
#> 
#> `actual`:       
#> `expected`: TRUE
#> 
#> ── Failure ('test-sector_profile.R:60:3'): at product level, unmatched product yield `NA` in the expected columns ──
#> is.na(out$profile_ranking) is not TRUE
#> 
#> `actual`:       
#> `expected`: TRUE
#> 
#> ── Failure ('test-sector_profile.R:108:3'): at company level, one matched and one unmatched products yield `value = 1/2` where `risk_category = NA` and in one other `risk_category` (#657) ──
#> `na` (`actual`) not equal to 1/2 (`expected`).
#> 
#>   `actual`: 0.0
#> `expected`: 0.5
#> 
#> ── Failure ('test-sector_profile.R:110:3'): at company level, one matched and one unmatched products yield `value = 1/2` where `risk_category = NA` and in one other `risk_category` (#657) ──
#> sort(other) (`actual`) not equal to c(0, 0, 1/2) (`expected`).
#> 
#>   `actual`: 0.0 0.0 1.0
#> `expected`: 0.0 0.0 0.5
#> 
#> ── Failure ('test-sector_profile.R:126:3'): at company level, two matched and one unmatched products yield `value = 1/3` where `risk_category = NA` and `value = 2/3` in one other `risk_category` (#657) ──
#> `na` (`actual`) not equal to 1/3 (`expected`).
#> 
#>   `actual`: 0.0
#> `expected`: 0.3
#> 
#> ── Failure ('test-sector_profile.R:128:3'): at company level, two matched and one unmatched products yield `value = 1/3` where `risk_category = NA` and `value = 2/3` in one other `risk_category` (#657) ──
#> sort(other) (`actual`) not equal to c(0, 0, 2/3) (`expected`).
#> 
#>   `actual`: 0.0 0.0 1.0
#> `expected`: 0.0 0.0 0.7
#> 
#> [ FAIL 8 | WARN 0 | SKIP 0 | PASS 11 ]

So I need to investigate each problematic result. In some cases the solution may be obvious, and the new requirement will lead me to adapt or remove a previous test. In other cases the conflict between the old and new requirements may not be obvious and we'll need to discuss further to decide what the code should actually do.

I'll come back tomorrow with a fresh brain.

@maurolepore that's indeed good news! Awesome.

I hope the Bad News won't cause too much headache to solve. Good luck!

@Tilmon,

sector_profile() now works as you expect. I fixed all tests and added a few more.

sector_profile_upstream() didn't get any update yet. Can you please help me by modifying this draft example so that I have something to test against?

It's bases on your googlesheet example for sector_profile() but it's a bit more complex since we need an additional inputs dataset.

reprex

devtools::load_all()
#> ℹ Loading tiltIndicator

# styler: off
companies <- tribble(
~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,
          "a",        "a",                         "a",          "a",             "a",
          "a",        "a",                         "a",          "a",             "a",
          "a",        "b",                 "unmatched",  "unmatched",     "unmatched",
          "a",        "c",                 "unmatched",          "c",             "c",
          "a",        "c",                 "unmatched",          "c",             "c"
)

inputs <- tribble(
  ~activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector,       ~type,     ~sector,  ~subsector, ~input_activity_uuid_product_uuid,
                          "a",                "a",                   "a",       "ipr",     "total",    "energy",                               "a",
                          "a",                "a",                   "a",       "weo",     "total",    "energy",                               "a",
                  "unmatched",        "unmatched",           "unmatched", "unmatched", "unmatched", "unmatched",                       "unmatched",
                  "unmatched",                "c",                   "c",       "ipr",  "land use",  "land use",                       "unmatched",
                  "unmatched",                "c",                   "c",       "weo",          NA,          NA,                       "unmatched"
)

scenarios <- tribble(
     ~sector,   ~subsector,  ~year, ~reductions, ~type, ~scenario,
     "total",     "energy",   2050,         1.0, "ipr",       "a",
     "total",     "energy",   2050,         0.6, "weo",       "a",
  "land use",   "land use",   2050,         0.3, "ipr",       "a"
)
# styler: on

sector_profile_upstream(companies, scenarios, inputs) |> unnest_product()
#> # A tibble: 8 × 13
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> 3 a            <NA>       <NA>                     NA   b        
#> 4 a            ipr_a_2050 low                       0.3 b        
#> 5 a            <NA>       <NA>                     NA   b        
#> 6 a            <NA>       <NA>                     NA   c        
#> 7 a            ipr_a_2050 low                       0.3 c        
#> 8 a            <NA>       <NA>                     NA   c        
#> # ℹ 8 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>,
#> #   input_activity_uuid_product_uuid <chr>, input_tilt_sector <chr>,
#> #   input_tilt_subsector <chr>
sector_profile_upstream(companies, scenarios, inputs) |> unnest_product()
#> # A tibble: 8 × 13
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            ipr_a_2050 high                      1   a        
#> 2 a            weo_a_2050 medium                    0.6 a        
#> 3 a            <NA>       <NA>                     NA   b        
#> 4 a            ipr_a_2050 low                       0.3 b        
#> 5 a            <NA>       <NA>                     NA   b        
#> 6 a            <NA>       <NA>                     NA   c        
#> 7 a            ipr_a_2050 low                       0.3 c        
#> 8 a            <NA>       <NA>                     NA   c        
#> # ℹ 8 more variables: activity_uuid_product_uuid <chr>, tilt_sector <chr>,
#> #   scenario <chr>, year <dbl>, type <chr>,
#> #   input_activity_uuid_product_uuid <chr>, input_tilt_sector <chr>,
#> #   input_tilt_subsector <chr>

Hi @maurolepore , awesome that the sector_profile() now shows the results as expected!! Congrats!

Re sector_upstream_profile(): @AnneSchoenauer and I discussed priorities and decided that given our ambitious target to launch the webtool in June, it's best to pause the sector_upstream_profile() work for now. It's anyways a bit tricky to publish the licensed input data in the webtool. Therefore, we suggest to launch the webtool at least in the first version without the upstream results.

Would therefore be great if you could instead soon start working on the webtool!

it's best to pause the sector_upstream_profile() work for now -- @Tilmon in https://github.com/2DegreesInvesting/tiltIndicator/pull/739#issuecomment-2117556054

Noted (https://github.com/2DegreesInvesting/tiltIndicator/issues/784).

Thanks!

2DegreesInvesting / tiltIndicator