At product level, `emissions_profile*()` should output `profile_ranking`

AnneSchoenauer commented 8 months ago

Relates to https://github.com/2DegreesInvesting/tiltIndicator/issues/566 Relates to https://github.com/2DegreesInvesting/tiltIndicator/issues/549

--

We need to calculate the transition risk score. For this what would be needed are two things - we need the SERT for the sector profiles (already planned for the enhancement of the tilt indicator) and the rank of the company - so in which percentile the company is located.

This ticket should be solved when this ticket here is completed.

Issue Description:

Background: We have a dataset that lists products and their associated carbon footprints. We've previously categorized these products based on their carbon footprints as "low", "medium", or "high". These categories correspond to the bottom 33%, the middle 33-66%, and the top 66+%, respectively.

Task: We need to further refine this data by adding an exact percentile rank for each product's carbon footprint. This will allow us to know if a product is, for example, in the 20th percentile or the 80th percentile, etc.

Acceptance Criteria:

Add a new column to the dataset named "Percentile Rank".
For each product, compute and populate its exact percentile rank based on its carbon footprint. This value should be between 0 and 100.
Preserve the original "low", "medium", "high" categorization, but add the exact percentile next to each product.

Steps:

Using the quantile function in R, compute the exact percentile ranks for all product carbon footprints.
Add these ranks to the dataset as a new column.
Ensure no values are missing or improperly calculated.

Notes:

Be mindful of potential ties in the dataset. If two products have the same carbon footprint, they should have the same rank. Ensure that the dataset maintains its original order after the ranking is done. Please also note that the rank is different based on which benchmark we use. Therefore, we would need those ranks for each six benchmarks. In other words the ranks depend on which benchmark (e.g. tilt_sec, tilt_sec_unit, tilt_isic...) we have.

maurolepore commented 8 months ago

@AnneSchoenauer, so this quantile-based score is different than the rank-based score behind "low", "medium", and "high"?

FYI, currently we first take the co2_footprint and apply rank_proportion() to get a score:

rank_proportion <- function(x) {
  rank(x) / length(x)
}

Then we take that score and categorize it with categorize_risk() to get the risk categories "low", "medium", and "high".

categorize_risk <- function(x, low_threshold, high_threshold, ...) {
  case_when(
    x > high_threshold ~ "high",
    x > low_threshold & x <= high_threshold ~ "medium",
    x <= low_threshold ~ "low",
    ...
  )
}

AnneSchoenauer commented 8 months ago

@maurolepore, thanks a lot for this.

FYI, currently we first take the co2_footprint and apply rank_proportion() to get a score:

Could you tell me how this score looks like for some data?

hen we take that score and categorize it with categorize_risk() to get the risk categories "low", "medium", and "high".

Can you tell me how the thresholds; high_threshold, and low_threshold is defined please?

Thanks a lot!

maurolepore commented 8 months ago

Here is an internal intermediate dataset that may help. Note the column values_to_categorize is the rank-based score that we use to create risk_category.

# A tibble: 10 × 12
   grouped_by co2_footprint values_to_categorize low_threshold high_threshold risk_category tilt_sec       tilt_subsector unit  isic_sec activity_uuid_product_uuid                                                ei_activity_name                      
   <chr>              <dbl>                <dbl>         <dbl>          <dbl> <chr>         <chr>          <chr>          <chr> <chr>    <chr>                                                                     <chr>                                 
 1 all               176.                    1           0.333          0.667 high          Industry       Other          unit  2560     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa cookstove production or electric      
 2 all                58.1                   0.8         0.333          0.667 high          Industry       Other          unit  2560     be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc microwave oven production             
 3 all                 4.95                  0.4         0.333          0.667 medium        Steel & Metals Steel          kg    2870     977d997e-c257-5033-ba39-d0edeeef4ba2_0ace02fa-eca5-482d-a829-c18e46a52db4 market for steel, chromium steel      
 4 all                12.5                   0.6         0.333          0.667 medium        Agriculture    Agriculture    kg    1780     ebb8475e-ff57-5e4e-937b-b5788186a5ca_ccee034c-8b6c-40d6-ac36-4c70c4623efa cheese production, soft, from cow milk
 5 all                 2.07                  0.2         0.333          0.667 low           Industry       Other          kg    2679     2f7b77a7-1556-5c1b-b0aa-c4534ddc8885_38d493e9-6feb-4c66-86eb-2253ef8ee54d market for seal, natural rubber based 
 6 isic_sec          176.                    1           0.333          0.667 high          Industry       Other          unit  2560     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa cookstove production or electric      
 7 isic_sec           58.1                   0.5         0.333          0.667 medium        Industry       Other          unit  2560     be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc microwave oven production             
 8 isic_sec            4.95                  1           0.333          0.667 high          Steel & Metals Steel          kg    2870     977d997e-c257-5033-ba39-d0edeeef4ba2_0ace02fa-eca5-482d-a829-c18e46a52db4 market for steel, chromium steel      
 9 isic_sec           12.5                   1           0.333          0.667 high          Agriculture    Agriculture    kg    1780     ebb8475e-ff57-5e4e-937b-b5788186a5ca_ccee034c-8b6c-40d6-ac36-4c70c4623efa cheese production, soft, from cow milk
10 isic_sec            2.07                  1           0.333          0.667 high          Industry       Other          kg    2679     2f7b77a7-1556-5c1b-b0aa-c4534ddc8885_38d493e9-6feb-4c66-86eb-2253ef8ee54d market for seal, natural rubber based

For the record, I got it by running the example of emissions_profile() and stopping execution at line 11 of the internal function emissions_profile_any_at_product_level() https://github.com/2DegreesInvesting/tiltIndicator/blob/2192074cb2264e905510f351c0c4092f07def7ae/R/emissions_profile_any_at_product_level.R#L11

emissions_profile_any_at_product_level <- function(companies,
                                                   co2,
                                                   low_threshold = 1 / 3,
                                                   high_threshold = 2 / 3) {
  co2 <- sanitize_co2(co2)
  x <- list(companies = companies, co2 = co2)
  epa_check(x)

  .companies <- prepare_companies(companies)
  .co2 <- prepare_co2(co2, low_threshold, high_threshold)

  .co2 |>
    epa_add_values_to_categorize() |>
    add_risk_category(low_threshold, high_threshold) |>
    join_companies(.companies) |>
    epa_select_cols_at_product_level() |>
    polish_output(cols_at_product_level())
}

AnneSchoenauer commented 8 months ago

Dear @maurolepore, Thanks a lot for following up on this! I think this is exactly what we would need. A final question to be one 100% sure. The length(x) are the number of all products that we use to do the benchmarking right? And the rank(x) is at which place one products' carbon footprint stand compared to all other products right? That means if we have for example 120 products with a carbon footprint that length(x) = 120 and if we now have one products whose carbon footrpint is the 4th lowest one that the values to categorise would be 4/120 right? So the values to categorise would be 0.03. And this is lower than the lowest threshold which is 0.333 and therefore it is categorised as low. Is this correct?

maurolepore commented 8 months ago

The length(x) are the number of all products that we use to do the benchmarking right? And the rank(x) is at which place one products' carbon footprint stand compared to all other products right?

I just explored the code and I see that we do every calculation withing the groups defined by each benchmark. The most comprehensive one is "all" -- which considers all rows in the dataset (but not all the products you wish: #566).

I'll Slack you a link to a video where I show this interactively in RStudio.

--

At the conceptual level I think the best person to ask if Tilman. I recall he could articulate this calculations clearly and from the top of his head. So if there is a mismatch between what you think should happen and what it actually happens, you may want to discuss with him. If the change is time-consuming it would be a waste to do it one way now then undo it later.

AnneSchoenauer commented 8 months ago

Thanks @maurolepore I will talk to Tilman today then but I am actually really sure that then unfortunately @tilman was doing a mistake here.... But thanks for letting me know. I double check with him and let you know how to continue.

AnneSchoenauer commented 8 months ago

Dear @maurolepore I talked to Tilman. However I will follow up in this ticket here: https://github.com/2DegreesInvesting/tiltIndicator/issues/566 as this ticket here is a slighlty different problem. For me it is now clear that we can have an "exact percentile rank" which is the variable "values_to_categorise". So this is great. Let's leave this issue here aside and fix first the benchmark issue#566

AnneSchoenauer commented 8 months ago

I think this issue would be something that refers to the output files. As we now calculate the "values_to_categorise" - now called "profile_ranking" in the tiltIndicatorBefore package we just need to ensure that this information is not lost and is part of the output files in the end. @kalashsinghal and @maurolepore. Who would be responsible for it?

maurolepore commented 8 months ago

I thinks this belongs to tiltIndicator. Once tiltIndicatorBefore computes profile_ranking using the entire ecoinvent dataset (#566), that column still needs to be exposed in tiltIndicator. So I'll leave it here and assigned to me.

AnneSchoenauer commented 8 months ago

Okay agree!

2DegreesInvesting / tiltIndicator

At product level, `emissions_profile*()` should output `profile_ranking` #581