In `emissions_profile*()` inform the co2 range for each `risk_category`

2DegreesInvesting / tiltIndicator

Implement the core business logic of the tilt indicators

https://2degreesinvesting.github.io/tiltIndicator/

GNU General Public License v3.0

1 stars 1 forks source link

In `emissions_profile*()` inform the co2 range for each `risk_category` #554

Closed AnneSchoenauer closed 10 months ago

AnneSchoenauer commented 11 months ago

The banks would like to see the Co2e range in which a product is located if it is categorised to a certain category like low, medium, high. So for example if the benchmark is all then all products that are in the category high should be associated with the same Co2e range. It would be great to see those ranges in the data as the column Co2_lowerand Co2_higher.

This would be nice to have implemented on the product and company level for the emission (upstream) profiles.

Please let me know if this is unclear.

maurolepore commented 10 months ago

@AnneSchoenauer, I need your help to figure out the methodology to compute those CO2 ranges.

What we know is this:

When we prepare the "co2" data (products or inputs) we rank all products in the ecoinvent dataset with profile_ranking = rank(co2_footprint) / lengh(co2_footprint).
When the user calls emissions_profile*() we compare each value of profile_ranking against the thresholds given by the user, and assign risk_category "high", "medium" or "low" as follows:
- If profile_ranking > high_threshold, then risk_category = "high".
- If profile_ranking > low_threshold & x <= high_threshold, then risk_category = "medium".
- If profile_ranking <= low_threshold, then risk_category = "low".

One challenge I see is that profile_ranking is calculated before we know which thresholds the user will choose.

AnneSchoenauer commented 10 months ago

Dear @maurolepore,

I think the co2 range is something different. I try to make it more explicit.

So we have this dataset toy_emission_profile_products(). In this dataset we have a co2_footprint. When we do the benchmarking what happens is that we group different products according to their sector/isic/unit. We then derive the profile_ranking and associate as you said correctly different risk_categories.

Now, what we would need is that per benchmark and per risk_category, we understand the range of the co2_footprint of all products that are in there. We then pick the highest and the lowest co2_footprint to create two new columns which show the ranges. I tried to create a reprex with some toy data. Please note that I made up the data completely. If you feel that there is a better convention to picture a range in the csv file please come up with a different solution. But I think with a dding the two columns I described it becomes clear what I want to do.

# Load required library
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Your original dataset
toy_dataset <- data.frame(
  ep_main_act = c(
    "manufacturer/ producer",
    "distributor",
    "manufacturer/ producer", 
    "manufacturer/ producer",
    "distributor",
    "manufacturer/ producer"
  ),
  ep_clustered = c(
    "oil",
    "oil",
    "animal fats",
    "animal fats",
    "olive oil", 
    "olive oil"
  ),
  activity_name = c(
    "oil production",
    "market for oil",
    "fatty acid production, from soybean oil", 
    "market for fatty acid, from soybean oil", 
    "olive oil production", 
    "market for oil"
  ),
  isic_digits = c(
    "3830",
    "3830",
    "3830", 
    "3830", 
    "3830", 
    "3830"
  ),
  co2_footprint = c(
    10,
    13,
    6,
    11, 
    3,
    5
  ),
  benchmark = c(
    "isic_sec", 
    "isic_sec",
    "isic_sec", 
    "isic_sec", 
    "isic_sec", 
    "isic_sec"
  ),
  risk_categories = c(
    "medium", 
    "high", 
    "medium", 
    "high", 
    "low", 
    "low"
  )
)

# Function to add upper and lower range columns
add_co2_footprint_range <- function(dataset) {
  dataset %>% 
    group_by(benchmark, risk_categories) %>%
    mutate(co2_footprint_upper = max(co2_footprint),
           co2_footprint_lower = min(co2_footprint)) %>%
    ungroup()
}

# Applying the function to your dataset
toy_dataset_with_range <- add_co2_footprint_range(toy_dataset)

# Viewing the modified dataset
print(toy_dataset_with_range)
#> # A tibble: 6 × 9
#>   ep_main_act     ep_clustered activity_name isic_digits co2_footprint benchmark
#>   <chr>           <chr>        <chr>         <chr>               <dbl> <chr>    
#> 1 manufacturer/ … oil          oil producti… 3830                   10 isic_sec 
#> 2 distributor     oil          market for o… 3830                   13 isic_sec 
#> 3 manufacturer/ … animal fats  fatty acid p… 3830                    6 isic_sec 
#> 4 manufacturer/ … animal fats  market for f… 3830                   11 isic_sec 
#> 5 distributor     olive oil    olive oil pr… 3830                    3 isic_sec 
#> 6 manufacturer/ … olive oil    market for o… 3830                    5 isic_sec 
#> # ℹ 3 more variables: risk_categories <chr>, co2_footprint_upper <dbl>,
#> #   co2_footprint_lower <dbl>

^{Created on 2023-11-15 with reprex v2.0.2}

maurolepore commented 10 months ago

@AnneSchoenauer thanks a lot for your reprex. It helps a lot.

You asked for feedback about the reprex itself. In "Reprex do's and don'ts" you'll see:

Use the smallest, simplest, most built-in data possible.

So here I reduce your reprex quite a bit while preserving your point.

I removed all comments are they simply restate what the code already says.
I use column names that reflect the implementation in tiltIndicator.
I use fewer columns -- just the ones we need to make your point.
I use simpler numbers like 1, 2 instead of 10, 13.
For grouped_by I use a stable value: "all". The value "isic_sec" is controvertial as it's a confusing crop of isic_4digit.
I styled the code according to the tidyverse style guide, and use # styler: off and # styler: on to stop the automatic styling of reprex to break the allignment of my tribble().
I use tribble() because it shows the code in tabular form, so we can understand it without needing to print it.
I don't use print() as it's unnecessary.
I removed the function. It's really nice but for a minimal reprex is not crucial.
I use the .by argument to mutate() because it does more simply what group_by() and ungroup() do. In a single call it's simpler.
If I'm using dplyr, then I use tibble() or tribble() -- not data.frame()` -- because tibbles print more nicely (including colum types).

# styler: off
library(dplyr, warn.conflicts = FALSE)

toy <- tribble(
  ~grouped_by, ~risk_category, ~co2_footprint,
        "all",       "medium",              1,
        "all",       "medium",              2,
        "all",         "high",              3,
        "all",         "high",              4,
)

toy |> 
  mutate(
    co2_footprint_lower = min(co2_footprint),
    co2_footprint_upper = max(co2_footprint),
    .by = c("grouped_by", "risk_category")
)
#> # A tibble: 4 × 5
#>   grouped_by risk_category co2_footprint co2_footprint_lower co2_footprint_upper
#>   <chr>      <chr>                 <dbl>               <dbl>               <dbl>
#> 1 all        medium                    1                   1                   2
#> 2 all        medium                    2                   1                   2
#> 3 all        high                      3                   3                   4
#> 4 all        high                      4                   3                   4
# styler: on

^{Created on 2023-11-15 with reprex v2.0.2}

maurolepore commented 10 months ago

And here is a more realistic reprex showing the entire workflow.

I draft two post-processing helpers -- following the open-close principle of good software design:

software entities (classes, modules, functions, and so on) should be open for extension, but closed for modification.

This avoids making the output emissions_profile*() different to the output of sector_profile*(), which would soon result in expensive maintenance.

BTW, is there a meaningful way to do something similar with sector_profile()? Would users be interested in the rage of reductions?

# New post-processing helpers
add_co2_range <- function(data) {
  data |> 
    mutate(
      lower = min(co2_footprint),
      upper = max(co2_footprint),
      .by = c("grouped_by", "risk_category")
    )
}

show_co2_range <- function(data) {
  data |> 
    add_co2_range() |> 
    distinct(grouped_by, risk_category, lower, upper) |> 
    filter(!is.na(grouped_by))
}

library(dplyr, warn.conflicts = FALSE)
library(readr, warn.conflicts = FALSE)
library(tiltToyData)
library(tiltIndicator)

options(readr.show_col_types = FALSE)
# Show wider results, i.e. don't crop the final tibble
options(width = 200)

companies <- read_csv(toy_emissions_profile_any_companies())
products <- read_csv(toy_emissions_profile_products())

product_level <- emissions_profile(companies, products) |> unnest_product()

# Maybe too much?
product_level |> add_co2_range()
#> # A tibble: 49 × 8
#>    companies_id                             grouped_by    risk_category clustered activity_uuid_product_uuid                                                co2_footprint lower upper
#>    <chr>                                    <chr>         <chr>         <chr>     <chr>                                                                             <dbl> <dbl> <dbl>
#>  1 fleischerei-stiefsohn_00000005219477-001 all           high          stove     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa         176.  58.1  176. 
#>  2 fleischerei-stiefsohn_00000005219477-001 isic_sec      high          stove     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa         176.   2.07 176. 
#>  3 fleischerei-stiefsohn_00000005219477-001 tilt_sec      high          stove     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa         176.   4.95 176. 
#>  4 fleischerei-stiefsohn_00000005219477-001 unit          high          stove     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa         176.  12.5  176. 
#>  5 fleischerei-stiefsohn_00000005219477-001 unit_isic_sec high          stove     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa         176.   2.07 176. 
#>  6 fleischerei-stiefsohn_00000005219477-001 unit_tilt_sec high          stove     0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa         176.   2.07 176. 
#>  7 fleischerei-stiefsohn_00000005219477-001 all           high          oven      be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc          58.1 58.1  176. 
#>  8 fleischerei-stiefsohn_00000005219477-001 isic_sec      medium        oven      be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc          58.1 58.1   58.1
#>  9 fleischerei-stiefsohn_00000005219477-001 tilt_sec      medium        oven      be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc          58.1 58.1   58.1
#> 10 fleischerei-stiefsohn_00000005219477-001 unit          medium        oven      be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc          58.1  4.95  58.1
#> # ℹ 39 more rows

# Just right?
product_level |> show_co2_range()
#> # A tibble: 15 × 4
#>    grouped_by    risk_category lower  upper
#>    <chr>         <chr>         <dbl>  <dbl>
#>  1 all           high          58.1  176.  
#>  2 isic_sec      high           2.07 176.  
#>  3 tilt_sec      high           4.95 176.  
#>  4 unit          high          12.5  176.  
#>  5 unit_isic_sec high           2.07 176.  
#>  6 unit_tilt_sec high           2.07 176.  
#>  7 isic_sec      medium        58.1   58.1 
#>  8 tilt_sec      medium        58.1   58.1 
#>  9 unit          medium         4.95  58.1 
#> 10 unit_isic_sec medium        58.1   58.1 
#> 11 unit_tilt_sec medium        58.1   58.1 
#> 12 all           medium         4.95  12.5 
#> 13 all           low            2.07   2.07
#> 14 tilt_sec      low            2.07   2.07
#> 15 unit          low            2.07   2.07

^{Created on 2023-11-15 with reprex v2.0.2}

AnneSchoenauer commented 10 months ago

Thanks @maurolepore! Also for your feedback on the reprex this is super valuable. And I will make sure to follow this next time! There are the following comments for now:

In general your reprex shows exactly what we need! Thanks!
The problem that I noticed is the following. The co2_footrpint data is data that we cannot share with the banks. What I noticed is in the product_level dataset that in column 13-15 the lower and upper range is 2.07. I think this can only be the case if we only have one product in this category right? The problem is if the lower and upper range is 2.07 that we run into problems as you could derive the co2_footprint for this specific product. Now I am thinking to include a rule that the ranges can only be shown if there are more than two products in the category. OR if round the co2_footprints and make it therefore less exact. In any case I will have to talk to @Tilmon about this! As we need to solve this. Do you have other ideas? We could also add like a SD maybe even - or a kind of uncertainty range... Let me think about this.
I also noticed taht we might run in a similiar problem with the above if we also show the profile_ranking column in parallel to the lower and upper bound. So if a bank nows that a product is on the profile_ranking 1 and the upper benchmark shows a co2 footprint it directly now the co2 footrpint for this specific product. 2+3 become clear to me when seeing your reprex. We just need to make sure that there is no reverse calculation that the banks can do to derive at the co2_footprint for specific products.
For the sector profile, the data is not licenced. Therefore, we can show the pure data and don't need to calculate any ranges. This is why I added this ticket here.

Sooo... maybe if you already have good ideas about 2 + 3 that would be helpful. Otherwise I would think about it with Tilman this afternoon ( he starts working at 12) and let you know of how we decide. But in general the reprex showed that you implemented the idea exactly as we wanted to have it.

maurolepore commented 10 months ago

@AnneSchoenauer

Great to know we arrived to a satisfactory implementation.

Regarding the conceptual limitations, How about adding some noise? Here I create a lil helper that defaults to adding 10% noise but we can change that to anything you think is acceptable. In this case I would also print a warning message alerting that the values are intentionally noisy, but I leave that out of the reprex for now.

library(dplyr, warn.conflicts = FALSE)

ranges <- tribble(
  ~lower, ~upper,
       1,      7,
       2,      8,
       3,      9,
)

add_noise <- function(x, noise = 0.1) {
  x * (1 + noise * rnorm(length(x)))
}

ranges |> 
  mutate(
    noisy_lower = add_noise(lower),
    noisy_upper = add_noise(upper)
  )
#> # A tibble: 3 × 4
#>   lower upper noisy_lower noisy_upper
#>   <dbl> <dbl>       <dbl>       <dbl>
#> 1     1     7       0.910        6.26
#> 2     2     8       2.09         8.48
#> 3     3     9       3.10        10.3

^{Created on 2023-11-17 with reprex v2.0.2}

AnneSchoenauer commented 10 months ago

@maurolepore This sounds really good!! Let me discuss with Tilman at 2 an thereafter we can take the decision! But I think this would be a very nice solution!

AnneSchoenauer commented 10 months ago

Dear @maurolepore, We noticed a problem with the 10% noise. The problem is that we want to be very transparent with banks in our methodology, so we would like to tell them that we added a 10% noise. However, if they again know that the noise is added and the specific number they can reverse the calculation. Therefore, we came up with the following rules. Let us know if this is possible technically:

For low risk category and high risk category combined with the profile ranking of 1 and the lowest rank please add the following (Case 3): For the lower bound for lowest risk category, please round always downwards:

if number is between 0-10 then please round to the full decimal 0,43 --> 0,4
if number is between 10-100 then please round on the full 1th digit, e.g. 15,7 --> 15
if number is between 100 - bigger please round on the the full 10th digit e.g. 153 --> 150

For the upper bound of the higher please round always upwards:

if number is between 0-10 then please round to the full decimal 0,43 --> 0,5
if number is between 10-100 then please round on the full 1th digit, e.g. 15,7 --> 16
if number is between 100 - bigger please round on the the full 10th digit e.g. 153 --> 160

With the case that there is only one product in one benchmark risk category (case 2), we would like to understand if this actually happens with real data - could you therefore implement some code that warns us if the case is happening? If the case indeed happens we need to find another solution but as we said, we don't think that this will happen.

Let me know if there are any questions!!!

maurolepore commented 10 months ago

The noise is random. How can that be reversed? Notice add_noice() uses rnorm().

For example, I created x but I don't share it with you. Can you tell me what's x from this?

# add_noise <- function(x, noise = 0.1) {
#     x * (1 + noise * rnorm(length(x)))
#   }
#  
# add_noise(x)
# [1] 0.8824415 2.1202853 3.0343012

maurolepore commented 10 months ago

OK, the solution is x <- c(1.1, 2.2, 3.3). Did you guess it?

AnneSchoenauer commented 10 months ago

Okay that makes more sense - we thought you just want to add 10% of the co2_footprint! I think that makes a lot of sense in this cases we would then need for the higher risk category and the upper bound this forumla:

# add_noise <- function(x, noise = 0.1) {
#     x * (1 + noise * rnorm(length(x)))
#   }

And for the low risk category lower bound this forumla:

# add_noise <- function(x, noise = 0.1) {
#     x * (1 - noise * rnorm(length(x)))
#   }

@Tilmon do you agree to this? I also chucked the formula into GPT to understand more about it. Have a look:

I think we can very transparently tell this to the banks. Shall we just start with 0.1 and have a look how this goes?

@Tilmon if you approve then @maurolepore you can implement it as such! Thanks a lot for this solution!

Tilmon commented 10 months ago

Using the randomized noise sounds good to me. Thanks @AnneSchoenauer and @maurolepore !

maurolepore commented 10 months ago

Great.

I believe we can create noise in the same way for both limits of the range? Note add_noise() adds a difference in both directions -- sometimes positive and sometimes negative.

library(dplyr, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)

ranges <- tibble(lower = 1:20, upper = lower + 50)

add_noise <- function(x, noise = 0.1) {
  x * (1 + noise * rnorm(length(x)))
}

noisy <- ranges |> 
  mutate(
    noisy_lower = add_noise(lower),
    noisy_upper = add_noise(upper),
    lower_difference = lower - noisy_lower,
    upper_difference = upper - noisy_upper
  )

noisy
#> # A tibble: 20 × 6
#>    lower upper noisy_lower noisy_upper lower_difference upper_difference
#>    <int> <dbl>       <dbl>       <dbl>            <dbl>            <dbl>
#>  1     1    51       0.920        41.5           0.0804            9.53 
#>  2     2    52       1.82         55.1           0.179            -3.10 
#>  3     3    53       3.31         60.7          -0.308            -7.65 
#>  4     4    54       4.29         59.7          -0.287            -5.74 
#>  5     5    55       5.13         55.1          -0.130            -0.102
#>  6     6    56       6.14         49.9          -0.140             6.12 
#>  7     7    57       6.68         55.0           0.320             2.02 
#>  8     8    58       9.20         56.2          -1.20              1.83 
#>  9     9    59       9.23         57.4          -0.230             1.61 
#> 10    10    60       9.80         59.3           0.197             0.681
#> 11    11    61      11.2          53.9          -0.182             7.12 
#> 12    12    62      11.6          70.6           0.358            -8.61 
#> 13    13    63      14.1          61.4          -1.12              1.56 
#> 14    14    64      17.2          66.9          -3.17             -2.89 
#> 15    15    65      14.7          84.7           0.284           -19.7  
#> 16    16    66      20.1          69.6          -4.05             -3.59 
#> 17    17    67      21.4          65.8          -4.37              1.23 
#> 18    18    68      18.1          79.0          -0.146           -11.0  
#> 19    19    69      18.8          64.2           0.217             4.82 
#> 20    20    70      16.5          60.8           3.51              9.15

ggplot(noisy) +
  geom_line(aes(lower, upper), color = "grey") +
  geom_point(aes(noisy_lower, noisy_upper))

^{Created on 2023-11-17 with reprex v2.0.2}

Tilmon commented 10 months ago

@maurolepore thanks for the info! I see an issue here and wonder what you say (also based on conversation with @AnneSchoenauer earlier today): If the noise can go in both directions, it may create a situation where the upper bound becomes lower than the lower bound, no?

E.g., let's say in a high category, the

"real" upper bound is 10 and 10 + noise = 9
and "real" lower bound is 9.5 and remains that way.
--> Then, the higher bound is lower than the actual lower bound. This is counterintuitive and not really insightful for the user.

Is that true or am I mistaken? If true, can you instead ensure that noise only goes in one direction as Anne proposed in her comment?

AnneSchoenauer commented 10 months ago

Okay Tilman was quicker but I was writing the same comment :D. But let me know Mauro if this is unclear then I try to clarify (I was doing a little repo but maybe Tilmans's explanation already is enough)?

maurolepore commented 10 months ago

I see. Good point. Yeah, certainly we can avoid that problem by adding noise only to the left of the lower limit and to the right of the upper limit.

@AnneSchoenauer, is that what your two implementations do? One adds noise to left of the lower limit and other one to the right of the upper limit?

maurolepore commented 10 months ago

In any case I have enough to draft a PR and share it with you.

Thanks!

AnneSchoenauer commented 10 months ago

@maurolepore - Yes exactly that is why in the first equation there is a "1 + randomised noise" and in the other there was a "1 - randomised noise". But maybe this doesn't work like this.

Looking forward to the PR! Thanks @maurolepore

maurolepore commented 10 months ago

FYI

I realized R includes a jitter() but the implementation seems complex. I would like to use it to benefit from it's robustness but the way to set the amount of jitter is quite complicated.

So I'm currently going with this version of your implementation, which seems to respond well to side and amont:

jitter_left <- function(x, amount = 0.1) {
  x - absolute_jitter(x, amount)
}

jitter_right <- function(x, amount = 0.1) {
  x + absolute_jitter(x, amount)
}

absolute_jitter <- function(x, amount = 0.1) {
  abs(amount * rnorm(length(x)))
}

library(tidyverse)

amount <- 0.25
x <- 1:10
data <- tibble(
  x,
  left = jitter_left(x, amount), 
  right = jitter_right(x, amount)
)

ggplot(data) +
  geom_line(aes(x, x)) +
  geom_point(aes(left, x), color = "blue") +
  geom_point(aes(right, x), color = "red")


amount <- 0.75
x <- 1:10
data <- tibble(
  x, 
  left = jitter_left(x, amount), 
  right = jitter_right(x, amount)
)

ggplot(data) +
  geom_line(aes(x, x)) +
  geom_point(aes(left, x), color = "blue") +
  geom_point(aes(right, x), color = "red")

^{Created on 2023-11-18 with reprex v2.0.2}