`profile_ranking` doesn't distribute the `activity_uuid_product_uuid`s in three equal parts by using thresholds `1/3` (low) and `2/3` (high) for emission profile indicator #789
I have found an issue after investigating unequal distribution of activity_uuid_product_uuids in three equal parts by the low and high thresholds (1/3 and 2/3).
We calculate the profile ranking for emission profile indicator by ranking the co2_footprint values after grouping them with six benchmarks (all, tilt_sector, isic_4digit, unit, unit_tilt_sector, unit_isic_4digit) using this formula:
As you can see from the above function, we are first ranking the co2 values and then dividing each value by the number of distinct/unique values in the co2_footprint column (for that specific group). Please ignore the fact the the grouping function does not exist in above function!
Above methodology will not divide the activity_uuid_product_uuids in three equal parts by thresholds 1/3 and 2/3 because we are dividing the ranks using unique co2 values and not the activity_uuid_product_uuids. There are many cases where we will have same co2 values for different activity_uuid_product_uuids. Due to this reason, the numerical gap between each rank will be more than it should be. This gap between each rank is decided by the denominator of the ranked value and this value needs to be correct if also want to divide the activity_uuid_product_uuids in three equal parts using the thresholds 1/3 and 2/3. Please have a look the below reprex for better understanding:
library(readr)
library(dplyr)
devtools::load_all(".")
#> ℹ Loading tiltIndicator
options(width = 500)
example <- tibble(
activity_uuid_product_uuid = c("uuid1", "uuid2", "uuid3", "uuid4", "uuid5", "uuid6", "uuid7", "uuid8", "uuid9", "uuid10"),
co2_footprint = c(1, 2, 3, 3, 3, 4, 4, 5, 6, 7),
isic_4digit = c("'3420'", "'3420'", "'3420'", "'3420'", "'3420'", "'3420'", "'3420'", "'3420'", "'3420'", "'3420'"),
tilt_sector = c("sec", "sec", "sec", "sec", "sec", "sec", "sec", "sec", "sec", "sec"),
unit = c("kg", "kg", "kg", "kg", "kg", "kg", "kg", "kg", "kg", "kg"),
)
example_output <- epa_compute_profile_ranking(example)
example_output |>
print(n = Inf)
#> # A tibble: 60 × 7
#> grouped_by profile_ranking activity_uuid_product_uuid co2_footprint isic_4digit tilt_sector unit
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
#> 1 all 0.143 uuid1 1 '3420' sec kg
#> 2 all 0.286 uuid2 2 '3420' sec kg
#> 3 all 0.429 uuid3 3 '3420' sec kg
#> 4 all 0.429 uuid4 3 '3420' sec kg
#> 5 all 0.429 uuid5 3 '3420' sec kg
#> 6 all 0.571 uuid6 4 '3420' sec kg
#> 7 all 0.571 uuid7 4 '3420' sec kg
#> 8 all 0.714 uuid8 5 '3420' sec kg
#> 9 all 0.857 uuid9 6 '3420' sec kg
#> 10 all 1 uuid10 7 '3420' sec kg
#> 11 isic_4digit 0.143 uuid1 1 '3420' sec kg
#> 12 isic_4digit 0.286 uuid2 2 '3420' sec kg
#> 13 isic_4digit 0.429 uuid3 3 '3420' sec kg
#> 14 isic_4digit 0.429 uuid4 3 '3420' sec kg
#> 15 isic_4digit 0.429 uuid5 3 '3420' sec kg
#> 16 isic_4digit 0.571 uuid6 4 '3420' sec kg
#> 17 isic_4digit 0.571 uuid7 4 '3420' sec kg
#> 18 isic_4digit 0.714 uuid8 5 '3420' sec kg
#> 19 isic_4digit 0.857 uuid9 6 '3420' sec kg
#> 20 isic_4digit 1 uuid10 7 '3420' sec kg
#> 21 tilt_sector 0.143 uuid1 1 '3420' sec kg
#> 22 tilt_sector 0.286 uuid2 2 '3420' sec kg
#> 23 tilt_sector 0.429 uuid3 3 '3420' sec kg
#> 24 tilt_sector 0.429 uuid4 3 '3420' sec kg
#> 25 tilt_sector 0.429 uuid5 3 '3420' sec kg
#> 26 tilt_sector 0.571 uuid6 4 '3420' sec kg
#> 27 tilt_sector 0.571 uuid7 4 '3420' sec kg
#> 28 tilt_sector 0.714 uuid8 5 '3420' sec kg
#> 29 tilt_sector 0.857 uuid9 6 '3420' sec kg
#> 30 tilt_sector 1 uuid10 7 '3420' sec kg
#> 31 unit 0.143 uuid1 1 '3420' sec kg
#> 32 unit 0.286 uuid2 2 '3420' sec kg
#> 33 unit 0.429 uuid3 3 '3420' sec kg
#> 34 unit 0.429 uuid4 3 '3420' sec kg
#> 35 unit 0.429 uuid5 3 '3420' sec kg
#> 36 unit 0.571 uuid6 4 '3420' sec kg
#> 37 unit 0.571 uuid7 4 '3420' sec kg
#> 38 unit 0.714 uuid8 5 '3420' sec kg
#> 39 unit 0.857 uuid9 6 '3420' sec kg
#> 40 unit 1 uuid10 7 '3420' sec kg
#> 41 unit_isic_4digit 0.143 uuid1 1 '3420' sec kg
#> 42 unit_isic_4digit 0.286 uuid2 2 '3420' sec kg
#> 43 unit_isic_4digit 0.429 uuid3 3 '3420' sec kg
#> 44 unit_isic_4digit 0.429 uuid4 3 '3420' sec kg
#> 45 unit_isic_4digit 0.429 uuid5 3 '3420' sec kg
#> 46 unit_isic_4digit 0.571 uuid6 4 '3420' sec kg
#> 47 unit_isic_4digit 0.571 uuid7 4 '3420' sec kg
#> 48 unit_isic_4digit 0.714 uuid8 5 '3420' sec kg
#> 49 unit_isic_4digit 0.857 uuid9 6 '3420' sec kg
#> 50 unit_isic_4digit 1 uuid10 7 '3420' sec kg
#> 51 unit_tilt_sector 0.143 uuid1 1 '3420' sec kg
#> 52 unit_tilt_sector 0.286 uuid2 2 '3420' sec kg
#> 53 unit_tilt_sector 0.429 uuid3 3 '3420' sec kg
#> 54 unit_tilt_sector 0.429 uuid4 3 '3420' sec kg
#> 55 unit_tilt_sector 0.429 uuid5 3 '3420' sec kg
#> 56 unit_tilt_sector 0.571 uuid6 4 '3420' sec kg
#> 57 unit_tilt_sector 0.571 uuid7 4 '3420' sec kg
#> 58 unit_tilt_sector 0.714 uuid8 5 '3420' sec kg
#> 59 unit_tilt_sector 0.857 uuid9 6 '3420' sec kg
#> 60 unit_tilt_sector 1 uuid10 7 '3420' sec kg
As you can see from the above reprex, the co2 value 3 has rank 0.429 which is above the 1/3 threshold which gives us only two activity_uuid_product_uuids below 1/3 threshold. This issue exist because of different unique values of activity_uuid_product_uuid and co2_footprint for a specific group.
Please let me know of any questions as its not easy to understand! :)
Dear @Tilmon @AnneSchoenauer
I have found an issue after investigating unequal distribution of
activity_uuid_product_uuid
s in three equal parts by the low and high thresholds (1/3 and 2/3).We calculate the profile ranking for emission profile indicator by ranking the
co2_footprint
values after grouping them with six benchmarks (all, tilt_sector, isic_4digit, unit, unit_tilt_sector, unit_isic_4digit) using this formula:As you can see from the above function, we are first ranking the co2 values and then dividing each value by the number of distinct/unique values in the
co2_footprint
column (for that specific group). Please ignore the fact the the grouping function does not exist in above function!Above methodology will not divide the
activity_uuid_product_uuid
s in three equal parts by thresholds 1/3 and 2/3 because we are dividing the ranks using unique co2 values and not theactivity_uuid_product_uuid
s. There are many cases where we will have same co2 values for differentactivity_uuid_product_uuid
s. Due to this reason, the numerical gap between each rank will be more than it should be. This gap between each rank is decided by the denominator of the ranked value and this value needs to be correct if also want to divide theactivity_uuid_product_uuid
s in three equal parts using the thresholds 1/3 and 2/3. Please have a look the below reprex for better understanding:Created on 2024-05-29 with reprex v2.0.2
As you can see from the above reprex, the co2 value
3
has rank 0.429 which is above the 1/3 threshold which gives us only twoactivity_uuid_product_uuid
s below 1/3 threshold. This issue exist because of different unique values ofactivity_uuid_product_uuid
andco2_footprint
for a specific group.Please let me know of any questions as its not easy to understand! :)
cc @maurolepore