Closed AnneSchoenauer closed 10 months ago
@AnneSchoenauer, I need your help to figure out the methodology to compute those CO2 ranges.
What we know is this:
When we prepare the "co2" data (products or inputs) we rank all products in the ecoinvent
dataset with profile_ranking = rank(co2_footprint) / lengh(co2_footprint)
.
When the user calls emissions_profile*()
we compare each value of profile_ranking
against the thresholds given by the user, and assign risk_category
"high", "medium" or "low" as follows:
profile_ranking
> high_threshold
, then risk_category
= "high".profile_ranking
> low_threshold & x <= high_threshold
, then risk_category = "medium".profile_ranking
<= low_threshold, then risk_category
= "low".One challenge I see is that profile_ranking
is calculated before we know which thresholds the user will choose.
Dear @maurolepore,
I think the co2 range is something different. I try to make it more explicit.
So we have this dataset toy_emission_profile_products()
. In this dataset we have a co2_footprint
. When we do the benchmarking what happens is that we group different products according to their sector/isic/unit. We then derive the profile_ranking and associate as you said correctly different risk_categories.
Now, what we would need is that per benchmark and per risk_category, we understand the range of the co2_footprint of all products that are in there. We then pick the highest and the lowest co2_footprint to create two new columns which show the ranges. I tried to create a reprex with some toy data. Please note that I made up the data completely. If you feel that there is a better convention to picture a range in the csv file please come up with a different solution. But I think with a dding the two columns I described it becomes clear what I want to do.
# Load required library
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#>
#> intersect, setdiff, setequal, union
# Your original dataset
toy_dataset <- data.frame(
ep_main_act = c(
"manufacturer/ producer",
"distributor",
"manufacturer/ producer",
"manufacturer/ producer",
"distributor",
"manufacturer/ producer"
),
ep_clustered = c(
"oil",
"oil",
"animal fats",
"animal fats",
"olive oil",
"olive oil"
),
activity_name = c(
"oil production",
"market for oil",
"fatty acid production, from soybean oil",
"market for fatty acid, from soybean oil",
"olive oil production",
"market for oil"
),
isic_digits = c(
"3830",
"3830",
"3830",
"3830",
"3830",
"3830"
),
co2_footprint = c(
10,
13,
6,
11,
3,
5
),
benchmark = c(
"isic_sec",
"isic_sec",
"isic_sec",
"isic_sec",
"isic_sec",
"isic_sec"
),
risk_categories = c(
"medium",
"high",
"medium",
"high",
"low",
"low"
)
)
# Function to add upper and lower range columns
add_co2_footprint_range <- function(dataset) {
dataset %>%
group_by(benchmark, risk_categories) %>%
mutate(co2_footprint_upper = max(co2_footprint),
co2_footprint_lower = min(co2_footprint)) %>%
ungroup()
}
# Applying the function to your dataset
toy_dataset_with_range <- add_co2_footprint_range(toy_dataset)
# Viewing the modified dataset
print(toy_dataset_with_range)
#> # A tibble: 6 × 9
#> ep_main_act ep_clustered activity_name isic_digits co2_footprint benchmark
#> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 manufacturer/ … oil oil producti… 3830 10 isic_sec
#> 2 distributor oil market for o… 3830 13 isic_sec
#> 3 manufacturer/ … animal fats fatty acid p… 3830 6 isic_sec
#> 4 manufacturer/ … animal fats market for f… 3830 11 isic_sec
#> 5 distributor olive oil olive oil pr… 3830 3 isic_sec
#> 6 manufacturer/ … olive oil market for o… 3830 5 isic_sec
#> # ℹ 3 more variables: risk_categories <chr>, co2_footprint_upper <dbl>,
#> # co2_footprint_lower <dbl>
Created on 2023-11-15 with reprex v2.0.2
@AnneSchoenauer thanks a lot for your reprex. It helps a lot.
You asked for feedback about the reprex itself. In "Reprex do's and don'ts" you'll see:
Use the smallest, simplest, most built-in data possible.
So here I reduce your reprex quite a bit while preserving your point.
grouped_by
I use a stable value: "all". The value "isic_sec" is controvertial as it's a confusing crop of isic_4digit
.# styler: off
and # styler: on
to stop the automatic styling of reprex to break the allignment of my tribble()
. tribble()
because it shows the code in tabular form, so we can understand it without needing to print it.print()
as it's unnecessary..by
argument to mutate()
because it does more simply what group_by()
and ungroup()
do. In a single call it's simpler.tibble()
or tribble()
-- not data.frame()` -- because tibbles print more nicely (including colum types).# styler: off
library(dplyr, warn.conflicts = FALSE)
toy <- tribble(
~grouped_by, ~risk_category, ~co2_footprint,
"all", "medium", 1,
"all", "medium", 2,
"all", "high", 3,
"all", "high", 4,
)
toy |>
mutate(
co2_footprint_lower = min(co2_footprint),
co2_footprint_upper = max(co2_footprint),
.by = c("grouped_by", "risk_category")
)
#> # A tibble: 4 × 5
#> grouped_by risk_category co2_footprint co2_footprint_lower co2_footprint_upper
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 all medium 1 1 2
#> 2 all medium 2 1 2
#> 3 all high 3 3 4
#> 4 all high 4 3 4
# styler: on
Created on 2023-11-15 with reprex v2.0.2
And here is a more realistic reprex showing the entire workflow.
I draft two post-processing helpers -- following the open-close principle of good software design:
software entities (classes, modules, functions, and so on) should be open for extension, but closed for modification.
This avoids making the output emissions_profile*()
different to the output of sector_profile*()
, which would soon result in expensive maintenance.
BTW, is there a meaningful way to do something similar with sector_profile()
? Would users be interested in the rage of reductions?
# New post-processing helpers
add_co2_range <- function(data) {
data |>
mutate(
lower = min(co2_footprint),
upper = max(co2_footprint),
.by = c("grouped_by", "risk_category")
)
}
show_co2_range <- function(data) {
data |>
add_co2_range() |>
distinct(grouped_by, risk_category, lower, upper) |>
filter(!is.na(grouped_by))
}
library(dplyr, warn.conflicts = FALSE)
library(readr, warn.conflicts = FALSE)
library(tiltToyData)
library(tiltIndicator)
options(readr.show_col_types = FALSE)
# Show wider results, i.e. don't crop the final tibble
options(width = 200)
companies <- read_csv(toy_emissions_profile_any_companies())
products <- read_csv(toy_emissions_profile_products())
product_level <- emissions_profile(companies, products) |> unnest_product()
# Maybe too much?
product_level |> add_co2_range()
#> # A tibble: 49 × 8
#> companies_id grouped_by risk_category clustered activity_uuid_product_uuid co2_footprint lower upper
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 fleischerei-stiefsohn_00000005219477-001 all high stove 0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa 176. 58.1 176.
#> 2 fleischerei-stiefsohn_00000005219477-001 isic_sec high stove 0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa 176. 2.07 176.
#> 3 fleischerei-stiefsohn_00000005219477-001 tilt_sec high stove 0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa 176. 4.95 176.
#> 4 fleischerei-stiefsohn_00000005219477-001 unit high stove 0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa 176. 12.5 176.
#> 5 fleischerei-stiefsohn_00000005219477-001 unit_isic_sec high stove 0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa 176. 2.07 176.
#> 6 fleischerei-stiefsohn_00000005219477-001 unit_tilt_sec high stove 0a242b09-772a-5edf-8e82-9cb4ba52a258_ae39ee61-d4d0-4cce-93b4-0745344da5fa 176. 2.07 176.
#> 7 fleischerei-stiefsohn_00000005219477-001 all high oven be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc 58.1 58.1 176.
#> 8 fleischerei-stiefsohn_00000005219477-001 isic_sec medium oven be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc 58.1 58.1 58.1
#> 9 fleischerei-stiefsohn_00000005219477-001 tilt_sec medium oven be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc 58.1 58.1 58.1
#> 10 fleischerei-stiefsohn_00000005219477-001 unit medium oven be06d25c-73dc-55fb-965b-0f300453e380_98b48ff2-2200-4b08-9dec-9c7c0e3585bc 58.1 4.95 58.1
#> # ℹ 39 more rows
# Just right?
product_level |> show_co2_range()
#> # A tibble: 15 × 4
#> grouped_by risk_category lower upper
#> <chr> <chr> <dbl> <dbl>
#> 1 all high 58.1 176.
#> 2 isic_sec high 2.07 176.
#> 3 tilt_sec high 4.95 176.
#> 4 unit high 12.5 176.
#> 5 unit_isic_sec high 2.07 176.
#> 6 unit_tilt_sec high 2.07 176.
#> 7 isic_sec medium 58.1 58.1
#> 8 tilt_sec medium 58.1 58.1
#> 9 unit medium 4.95 58.1
#> 10 unit_isic_sec medium 58.1 58.1
#> 11 unit_tilt_sec medium 58.1 58.1
#> 12 all medium 4.95 12.5
#> 13 all low 2.07 2.07
#> 14 tilt_sec low 2.07 2.07
#> 15 unit low 2.07 2.07
Created on 2023-11-15 with reprex v2.0.2
Thanks @maurolepore! Also for your feedback on the reprex this is super valuable. And I will make sure to follow this next time! There are the following comments for now:
Sooo... maybe if you already have good ideas about 2 + 3 that would be helpful. Otherwise I would think about it with Tilman this afternoon ( he starts working at 12) and let you know of how we decide. But in general the reprex showed that you implemented the idea exactly as we wanted to have it.
@AnneSchoenauer
Great to know we arrived to a satisfactory implementation.
Regarding the conceptual limitations, How about adding some noise? Here I create a lil helper that defaults to adding 10% noise but we can change that to anything you think is acceptable. In this case I would also print a warning message alerting that the values are intentionally noisy, but I leave that out of the reprex for now.
library(dplyr, warn.conflicts = FALSE)
ranges <- tribble(
~lower, ~upper,
1, 7,
2, 8,
3, 9,
)
add_noise <- function(x, noise = 0.1) {
x * (1 + noise * rnorm(length(x)))
}
ranges |>
mutate(
noisy_lower = add_noise(lower),
noisy_upper = add_noise(upper)
)
#> # A tibble: 3 × 4
#> lower upper noisy_lower noisy_upper
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 7 0.910 6.26
#> 2 2 8 2.09 8.48
#> 3 3 9 3.10 10.3
Created on 2023-11-17 with reprex v2.0.2
@maurolepore This sounds really good!! Let me discuss with Tilman at 2 an thereafter we can take the decision! But I think this would be a very nice solution!
Dear @maurolepore, We noticed a problem with the 10% noise. The problem is that we want to be very transparent with banks in our methodology, so we would like to tell them that we added a 10% noise. However, if they again know that the noise is added and the specific number they can reverse the calculation. Therefore, we came up with the following rules. Let us know if this is possible technically:
For low risk category and high risk category combined with the profile ranking of 1 and the lowest rank please add the following (Case 3): For the lower bound for lowest risk category, please round always downwards:
For the upper bound of the higher please round always upwards:
With the case that there is only one product in one benchmark risk category (case 2), we would like to understand if this actually happens with real data - could you therefore implement some code that warns us if the case is happening? If the case indeed happens we need to find another solution but as we said, we don't think that this will happen.
Let me know if there are any questions!!!
The noise is random. How can that be reversed? Notice add_noice()
uses rnorm()
.
For example, I created x
but I don't share it with you. Can you tell me what's x
from this?
# add_noise <- function(x, noise = 0.1) {
# x * (1 + noise * rnorm(length(x)))
# }
#
# add_noise(x)
# [1] 0.8824415 2.1202853 3.0343012
OK, the solution is x <- c(1.1, 2.2, 3.3)
. Did you guess it?
Okay that makes more sense - we thought you just want to add 10% of the co2_footprint! I think that makes a lot of sense in this cases we would then need for the higher risk category and the upper bound this forumla:
# add_noise <- function(x, noise = 0.1) {
# x * (1 + noise * rnorm(length(x)))
# }
And for the low risk category lower bound this forumla:
# add_noise <- function(x, noise = 0.1) {
# x * (1 - noise * rnorm(length(x)))
# }
@Tilmon do you agree to this? I also chucked the formula into GPT to understand more about it. Have a look:
I think we can very transparently tell this to the banks. Shall we just start with 0.1 and have a look how this goes?
@Tilmon if you approve then @maurolepore you can implement it as such! Thanks a lot for this solution!
Using the randomized noise sounds good to me. Thanks @AnneSchoenauer and @maurolepore !
Great.
I believe we can create noise in the same way for both limits of the range? Note add_noise()
adds a difference in both directions -- sometimes positive and sometimes negative.
library(dplyr, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
ranges <- tibble(lower = 1:20, upper = lower + 50)
add_noise <- function(x, noise = 0.1) {
x * (1 + noise * rnorm(length(x)))
}
noisy <- ranges |>
mutate(
noisy_lower = add_noise(lower),
noisy_upper = add_noise(upper),
lower_difference = lower - noisy_lower,
upper_difference = upper - noisy_upper
)
noisy
#> # A tibble: 20 × 6
#> lower upper noisy_lower noisy_upper lower_difference upper_difference
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 51 0.920 41.5 0.0804 9.53
#> 2 2 52 1.82 55.1 0.179 -3.10
#> 3 3 53 3.31 60.7 -0.308 -7.65
#> 4 4 54 4.29 59.7 -0.287 -5.74
#> 5 5 55 5.13 55.1 -0.130 -0.102
#> 6 6 56 6.14 49.9 -0.140 6.12
#> 7 7 57 6.68 55.0 0.320 2.02
#> 8 8 58 9.20 56.2 -1.20 1.83
#> 9 9 59 9.23 57.4 -0.230 1.61
#> 10 10 60 9.80 59.3 0.197 0.681
#> 11 11 61 11.2 53.9 -0.182 7.12
#> 12 12 62 11.6 70.6 0.358 -8.61
#> 13 13 63 14.1 61.4 -1.12 1.56
#> 14 14 64 17.2 66.9 -3.17 -2.89
#> 15 15 65 14.7 84.7 0.284 -19.7
#> 16 16 66 20.1 69.6 -4.05 -3.59
#> 17 17 67 21.4 65.8 -4.37 1.23
#> 18 18 68 18.1 79.0 -0.146 -11.0
#> 19 19 69 18.8 64.2 0.217 4.82
#> 20 20 70 16.5 60.8 3.51 9.15
ggplot(noisy) +
geom_line(aes(lower, upper), color = "grey") +
geom_point(aes(noisy_lower, noisy_upper))
Created on 2023-11-17 with reprex v2.0.2
@maurolepore thanks for the info! I see an issue here and wonder what you say (also based on conversation with @AnneSchoenauer earlier today): If the noise can go in both directions, it may create a situation where the upper bound becomes lower than the lower bound, no?
E.g., let's say in a high category, the
10 + noise = 9
Is that true or am I mistaken? If true, can you instead ensure that noise only goes in one direction as Anne proposed in her comment?
Okay Tilman was quicker but I was writing the same comment :D. But let me know Mauro if this is unclear then I try to clarify (I was doing a little repo but maybe Tilmans's explanation already is enough)?
I see. Good point. Yeah, certainly we can avoid that problem by adding noise only to the left of the lower limit and to the right of the upper limit.
@AnneSchoenauer, is that what your two implementations do? One adds noise to left of the lower limit and other one to the right of the upper limit?
In any case I have enough to draft a PR and share it with you.
Thanks!
@maurolepore - Yes exactly that is why in the first equation there is a "1 + randomised noise" and in the other there was a "1 - randomised noise". But maybe this doesn't work like this.
Looking forward to the PR! Thanks @maurolepore
FYI
I realized R includes a jitter()
but the implementation seems complex. I would like to use it to benefit from it's robustness but the way to set the amount of jitter is quite complicated.
So I'm currently going with this version of your implementation, which seems to respond well to side and amont:
jitter_left <- function(x, amount = 0.1) {
x - absolute_jitter(x, amount)
}
jitter_right <- function(x, amount = 0.1) {
x + absolute_jitter(x, amount)
}
absolute_jitter <- function(x, amount = 0.1) {
abs(amount * rnorm(length(x)))
}
library(tidyverse)
amount <- 0.25
x <- 1:10
data <- tibble(
x,
left = jitter_left(x, amount),
right = jitter_right(x, amount)
)
ggplot(data) +
geom_line(aes(x, x)) +
geom_point(aes(left, x), color = "blue") +
geom_point(aes(right, x), color = "red")
amount <- 0.75
x <- 1:10
data <- tibble(
x,
left = jitter_left(x, amount),
right = jitter_right(x, amount)
)
ggplot(data) +
geom_line(aes(x, x)) +
geom_point(aes(left, x), color = "blue") +
geom_point(aes(right, x), color = "red")
Created on 2023-11-18 with reprex v2.0.2
The banks would like to see the Co2e range in which a product is located if it is categorised to a certain category like
low
,medium
,high
. So for example if the benchmark isall
then all products that are in the categoryhigh
should be associated with the same Co2e range. It would be great to see those ranges in the data as the columnCo2_lower
andCo2_higher
.This would be nice to have implemented on the product and company level for the emission (upstream) profiles.
Please let me know if this is unclear.