Closed maurolepore closed 5 months ago
Here is my idea. All of this would happen in tiltIndicatorBefore.
The new ecoinvent
dataset has products that do and don't exist in europages. It's passed to emissions_profile_add_values_to_categorize()
which adds the columns grouped_by
and values_to_categorize
. The output is then reduced to only europages products by joining a "co2" dataset (products or inputs) by a join by activity_uuid_product_uuid
. The result is a new version of the "co2" dataset (products or inputs) which can be passed to emissions_profile*()
.
cc' @AnneSchoenauer @kalashsinghal
library(dplyr, warn.conflicts = FALSE)
path_to_tilt_indicator <- "~/git/tiltIndicator"
devtools::load_all(path_to_tilt_indicator)
#> ℹ Loading tiltIndicator
# NEW PRE-PROCESSING HELPER (maybe defined in tiltIndicatorBefore)
compute_values_to_categorize <- function(co2, ecoinvent) {
ranked <- emissions_profile_any_add_values_to_categorize(ecoinvent)
cols_to_add <- c("grouped_by", "values_to_categorize")
cols_to_join_by <- "activity_uuid_product_uuid"
relevant_cols <- select(ranked, all_of(cols_to_add), all_of(cols_to_join_by))
left_join(co2, relevant_cols)
}
# NEW ECOINVENT DATASET
# Has products that do and don't exist in europages
# Has the crucial colums required to add the `values_to_categorize`
ecoinvent <- tribble(
~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
"in-ep", "a", "a", "1234", "1",
"not-in-ep", "a", "a", "1234", "1"
)
# NEW PRODUCTS DATASET
# Only has products that do exist in europages
old_products <- tribble(
~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
"in-ep", "a", "a", "1234", "1"
)
new_products <- compute_values_to_categorize(old_products, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`
new_products |> relocate(c("grouped_by", "values_to_categorize"))
#> # A tibble: 6 × 7
#> grouped_by values_to_categorize activity_uuid_produc…¹ tilt_sector unit
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 all 0.75 in-ep a a
#> 2 isic_4digit 0.75 in-ep a a
#> 3 tilt_sector 0.75 in-ep a a
#> 4 unit 0.75 in-ep a a
#> 5 unit_isic_4digit 0.75 in-ep a a
#> 6 unit_tilt_sector 0.75 in-ep a a
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: isic_4digit <chr>, co2_footprint <chr>
# NEW INPUTS DATASET
# Only has products that do exist in europages
old_inputs <- tribble(
~activity_uuid_product_uuid, ~input_activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector, ~input_unit, ~input_isic_4digit, ~input_co2_footprint, ~type, ~sector, ~subsector,
"in-ep", "a", "a", "a", "a", "1234", "1", "ipr", "total", "energy",
)
new_inputs <- compute_values_to_categorize(old_inputs, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`
new_inputs |> relocate(c("grouped_by", "values_to_categorize"))
#> # A tibble: 6 × 12
#> grouped_by values_to_categorize activity_uuid_produc…¹ input_activity_uuid_…²
#> <chr> <dbl> <chr> <chr>
#> 1 all 0.75 in-ep a
#> 2 isic_4digit 0.75 in-ep a
#> 3 tilt_sector 0.75 in-ep a
#> 4 unit 0.75 in-ep a
#> 5 unit_isic_… 0.75 in-ep a
#> 6 unit_tilt_… 0.75 in-ep a
#> # ℹ abbreviated names: ¹activity_uuid_product_uuid,
#> # ²input_activity_uuid_product_uuid
#> # ℹ 8 more variables: input_tilt_sector <chr>, input_tilt_subsector <chr>,
#> # input_unit <chr>, input_isic_4digit <chr>, input_co2_footprint <chr>,
#> # type <chr>, sector <chr>, subsector <chr>
Created on 2023-11-13 with reprex v2.0.2
@AnneSchoenauer and @kalashsinghal please discuss if this makes sense or you have a different idea. Remember I'm not familiar with data-preparation so I may be easily wrong.
@maurolepore Do you want the values in grouped_by
column to be the way you mentioned above or they should be like isic_sec
, tilt_sec
, unit_isic_sec
, and unit_tilt_sec
(because that's how they are in tiltIndicator output) ?
Also adding all
and unit
as a group to it right @kalashsinghal
@AnneSchoenauer @maurolepore What's the need of ecoinvent products which don't exist in Europages?
FYI: I see that I have to add sectors information to the ecoinevent datasets with a completely different code to provide this data. Current code only assign sectoral information to the ecoinvent products which exist in Europages.
@AnneSchoenauer
Also adding all and unit as a group to it right @kalashsinghal
The reprex above shows that grouped_by
already has values "all" and "unit". Is that what you mean?
@kalashsinghal
What's the need of ecoinvent products which don't exist in Europages?
As far as I know we don't use them. So if you already have code that ignores them, they great. I only added them in my reprex because that's how I understood ecoinvent
is defined. Without products that do not exist in europages, then as far as I can see, my draft of ecoinvent
would be just like the "co2" datasets.
@kalashsinghal
@maurolepore Do you want the values in grouped_by column to be the way you mentioned above or they should be like isic_sec, tilt_sec, unit_isic_sec, and unit_tilt_sec (because that's how they are in tiltIndicator output) ?
I'm not sure what you mean by " that's how they are in tiltIndicator output". I think you can use the columns as I show in the reprex above, which will be passed to grouped_by
, and used in tiltIndicator just fine. Here is a new reprex showing the workflow all the way until the output of tiltIndicator. It looks good to me. Do you see any issue I don't?
BTW since we seem to not need the products that do not exist in europages, this new reprex is simpler and focuses only on products that do exist in europages. Here it's represented by *uuid
= "a".
library(dplyr, warn.conflicts = FALSE)
devtools::load_all()
#> ℹ Loading tiltIndicator
# PRE-PROCESSING
compute_profile_ranking <- function(co2, ecoinvent) {
ranked <- emissions_profile_any_compute_profile_ranking(ecoinvent)
cols_to_add <- c("grouped_by", "profile_ranking")
cols_to_join_by <- "activity_uuid_product_uuid"
relevant_cols <- select(ranked, all_of(cols_to_add), all_of(cols_to_join_by))
left_join(co2, relevant_cols)
}
ecoinvent <- tribble(
~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
"a", "a", "a", "1234", "1",
)
old_products <- tribble(
~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
"a", "a", "a", "1234", "1"
)
new_products <- compute_profile_ranking(old_products, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`
new_products |> relocate(grouped_by, activity_uuid_product_uuid)
#> # A tibble: 6 × 7
#> grouped_by activity_uuid_produc…¹ tilt_sector unit isic_4digit co2_footprint
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 all a a a 1234 1
#> 2 isic_4digit a a a 1234 1
#> 3 tilt_sector a a a 1234 1
#> 4 unit a a a 1234 1
#> 5 unit_isic_… a a a 1234 1
#> 6 unit_tilt_… a a a 1234 1
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 1 more variable: profile_ranking <dbl>
# tiltIndicator
companies <- tribble(
~company_id, ~clustered, ~activity_uuid_product_uuid, ~sector, ~subsector, ~tilt_sector, ~tilt_subsector, ~type,
"a", "a", "a", "total", "energy", "a", "a", "ipr"
)
result <- emissions_profile(companies, new_products)
result |> unnest_product()
#> # A tibble: 6 × 6
#> companies_id grouped_by risk_category clustered activity_uuid_product_…¹
#> <chr> <chr> <chr> <chr> <chr>
#> 1 a all high a a
#> 2 a isic_4digit high a a
#> 3 a tilt_sector high a a
#> 4 a unit high a a
#> 5 a unit_isic_4digit high a a
#> 6 a unit_tilt_sector high a a
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 1 more variable: co2_footprint <chr>
result |> unnest_company()
#> # A tibble: 18 × 4
#> companies_id grouped_by risk_category value
#> <chr> <chr> <chr> <dbl>
#> 1 a all high 1
#> 2 a all medium 0
#> 3 a all low 0
#> 4 a isic_4digit high 1
#> 5 a isic_4digit medium 0
#> 6 a isic_4digit low 0
#> 7 a tilt_sector high 1
#> 8 a tilt_sector medium 0
#> 9 a tilt_sector low 0
#> 10 a unit high 1
#> 11 a unit medium 0
#> 12 a unit low 0
#> 13 a unit_isic_4digit high 1
#> 14 a unit_isic_4digit medium 0
#> 15 a unit_isic_4digit low 0
#> 16 a unit_tilt_sector high 1
#> 17 a unit_tilt_sector medium 0
#> 18 a unit_tilt_sector low 0
Created on 2023-11-13 with reprex v2.0.2
@kalashsinghal
Note that if we keep using the "input_" prefix then you'll need to use an ecoinvent
dataset that has that prefix in the names -- at least with this implementation of emissions_profile_any_compute_profile_ranking()
.
BTW I think the "input_" prefix is very expensive to maintain and thus prone to bugs. I may remove it internally (see https://github.com/2DegreesInvesting/tiltIndicator/issues/607) but it would be best to remove it from everywhere. products$tilt_sector
is clearly different than input$tilt_sector
and input$input_tilt_sector
is redundant and hard to program with. cc' @AnneSchoenauer
# styler: off
library(dplyr, warn.conflicts = FALSE)
devtools::load_all()
#> ℹ Loading tiltIndicator
# PRE-PROCESSING
compute_profile_ranking <- function(co2, ecoinvent) {
ranked <- emissions_profile_any_compute_profile_ranking(ecoinvent)
cols_to_add <- c("grouped_by", "profile_ranking")
cols_to_join_by <- "activity_uuid_product_uuid"
relevant_cols <- select(ranked, all_of(cols_to_add), all_of(cols_to_join_by))
left_join(co2, relevant_cols)
}
ecoinvent <- tribble(
~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
"a", "a", "a", "1234", "1",
) |>
# Here is what you need to raname
rename_with(~ paste0("input_", .x), -activity_uuid_product_uuid)
old_inputs <- tribble(
~activity_uuid_product_uuid, ~input_activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector, ~input_unit, ~input_isic_4digit, ~input_co2_footprint, ~type, ~sector, ~subsector,
"a", "a", "a", "a", "a", "1234", "1", "ipr", "total", "energy"
)
new_inputs <- compute_profile_ranking(old_inputs, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`
new_inputs |> relocate(grouped_by, activity_uuid_product_uuid)
#> # A tibble: 6 × 12
#> grouped_by activity_uuid_produc…¹ input_activity_uuid_…² input_tilt_sector
#> <chr> <chr> <chr> <chr>
#> 1 all a a a
#> 2 input_isic_4d… a a a
#> 3 input_tilt_se… a a a
#> 4 input_unit a a a
#> 5 input_unit_in… a a a
#> 6 input_unit_in… a a a
#> # ℹ abbreviated names: ¹activity_uuid_product_uuid,
#> # ²input_activity_uuid_product_uuid
#> # ℹ 8 more variables: input_tilt_subsector <chr>, input_unit <chr>,
#> # input_isic_4digit <chr>, input_co2_footprint <chr>, type <chr>,
#> # sector <chr>, subsector <chr>, profile_ranking <dbl>
# tiltIndicator
companies <- tribble(
~company_id, ~clustered, ~activity_uuid_product_uuid, ~sector, ~subsector, ~tilt_sector, ~tilt_subsector, ~type,
"a", "a", "a", "total", "energy", "a", "a", "ipr"
)
result <- emissions_profile_upstream(companies, new_inputs)
result |> unnest_product()
#> # A tibble: 6 × 7
#> companies_id grouped_by risk_category clustered activity_uuid_produc…¹
#> <chr> <chr> <chr> <chr> <chr>
#> 1 a all high a a
#> 2 a input_isic_4digit high a a
#> 3 a input_tilt_sector high a a
#> 4 a input_unit high a a
#> 5 a input_unit_input_… high a a
#> 6 a input_unit_input_… high a a
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 2 more variables: input_activity_uuid_product_uuid <chr>,
#> # input_co2_footprint <chr>
result |> unnest_company()
#> # A tibble: 18 × 4
#> companies_id grouped_by risk_category value
#> <chr> <chr> <chr> <dbl>
#> 1 a all high 1
#> 2 a all medium 0
#> 3 a all low 0
#> 4 a input_isic_4digit high 1
#> 5 a input_isic_4digit medium 0
#> 6 a input_isic_4digit low 0
#> 7 a input_tilt_sector high 1
#> 8 a input_tilt_sector medium 0
#> 9 a input_tilt_sector low 0
#> 10 a input_unit high 1
#> 11 a input_unit medium 0
#> 12 a input_unit low 0
#> 13 a input_unit_input_isic_4digit high 1
#> 14 a input_unit_input_isic_4digit medium 0
#> 15 a input_unit_input_isic_4digit low 0
#> 16 a input_unit_input_tilt_sector high 1
#> 17 a input_unit_input_tilt_sector medium 0
#> 18 a input_unit_input_tilt_sector low 0
# styler: on
Created on 2023-11-13 with reprex v2.0.2
@kalashsinghal
What's the need of ecoinvent products which don't exist in Europages?
As far as I know we don't use them. So if you already have code that ignores them, they great. I only added them in my reprex because that's how I understood
ecoinvent
is defined. Without products that do not exist in europages, then as far as I can see, my draft ofecoinvent
would be just like the "co2" datasets.
Hi both - I think we do need all ecoinvent activites and we would also need to match all ecoinvent activities to tilt_sectors and tilt_subsectors to be able to do the benchmarking.
The point here Kalash is that we want to compare ALL eocinvent activities to each other. And then we only pick those that are also in europages. Therefore, the europages products are compared to ALL ecoinvent activities and not only to the products that were able to be matched with europages. This was the whole point of creating the issue here. So this would be crucial to include ALL ecoinvent activities, creating for them the values_to_categorise with regard to the six benchmarks (tilt_sec, tilt_sec_unit, isic_sec, isic_sec_unit, all, unit).
If you have questions, let me know @kalashsinghal and we can discuss.
Oh, of course -- that was the whole point! Thanks Anne :-)
The point here Kalash is that we want to compare ALL eocinvent activities to each other. And then we only pick those that are also in europages. Therefore, the europages products are compared to ALL ecoinvent activities and not only to the products that were able to be matched with europages. This was the whole point of creating the issue here. So this would be crucial to include ALL ecoinvent activities, creating for them the values_to_categorise with regard to the six benchmarks (tilt_sec, tilt_sec_unit, isic_sec, isic_sec_unit, all, unit).
@AnneSchoenauer @maurolepore This means we need these three datasets:
Emissions_profile_products
dataset which only has europages products (with grouped_by
and profile_ranking
)Emissions_profile_upstream_products
dataset which only has europages products (with grouped_by
and profile_ranking
)grouped_by
and profile_ranking
)Am I right?
Please let me know if it does make sense to combine Emissions_profile_products
and Emissions_profile_upstream_products
into a single dataframe because in that case we might need to create separate columns of profile_ranking
and profile_ranking_upstream
in a single dataset, however the grouped_by
will remain common. Thanks!
Yes, I believe we need those three datasets. So no, I would not combine products and inputs (upstream-products) into a single dataset.
Let's see what Anne says, but do you see any good reason to combine them?
What I would do (eventually) is remove the "input_" prefix as explained above.
What I would do (eventually) is remove the "input_" prefix as explained above.
@maurolepore I believe the column names would remain the same (for ex: input_tilt_sector
) but only the values in grouped_by
column will change like tilt_sector
. right?
In the current implementation grouped_by
stores values that come from the column names. In the reprex above you can see that if ecoinvent
has the column name "tilt_sector" then grouped_by
gets the value "tilt_sector". And if ecoinvent
has the column name "input_tilt_sector" then grouped_by
gets the value "input_tilt_sector".
For now it seems simpler to continue to use the columns names we used before. But eventually I think we should remove the "input_" prefix from everywhere.
Hi both,
As I said before - we won't need any data from europages. Only data from ecoinvent.
Please take the ecoinvent data and take only the activities in the geographies that we filtered (see geography filter from Bob). The columns that we need are:
To derive at the tilt_sec and tilt_subsector please use first the sector mapper and for those were the sector mapper don't work, sector resolving.
You can also see the whole idea here: https://github.com/2DegreesInvesting/tiltIndicator/issues/566
Thanks a lot.
@maurolepore You might have updated it somewhere else before, however please update once again the input files' structure that you would need for tiltIndicator after this change. I will adjust my code accordingly. Thanks!
Note: Only for Emissions_profile
and Emissions_profile_upstream
indicators
@kalashsinghal,
Since this PR, tiltIndicator already knows how to handle the new data. Sure, I can update tiltToyData so that the toy "co2" datasets (products and inputs) gain the columns grouped_by
and profile_ranking
. But that's all that tiltIndicator uses. So I guess that shoulddn't block you?
What I would do in the toy datasets is what I showed in the reprex above (see new_products
and new_inputs
). But the conversation seems to have continued between you and Anne. If that's good to you, then sure I can create a PR to tiltToyData and ask for your review.
@maurolepore Yes please create a PR for the toyData which will be inputs to tiltIndicator. I want to remain aligned while working on the input requirements. And I guess there could be different datasets I might give you but I am not sure about that as of yet. We can discuss more about that in that PR.
@maurolepore I have emailed you the toy datasets which could be possible inputs to tiltIndicator. Please create tickets for any new issue.
Thanks @kalashsinghal, the datasets look great. Note they have an unexpected first column but that's easy to remove.
I share a reprex in the related PR in tiltIndicatorBefore (here) because I wanted to leave a record of the data in that PR .
I understand these are public, toy datasets, so I'll go ahead and add them to this repo.
Thanks!
For the record,
These are the current datasets (stored in inst/extdata/)
toy_emissions_profile_any_companies()
toy_emissions_profile_products()
toy_emissions_profile_upstream_products()
toy_sector_profile_any_scenarios()
toy_sector_profile_companies()
toy_sector_profile_upstream_companies()
toy_sector_profile_upstream_products()
This issue aims to develop new datasets or new versions of existing emissions*
datasets. For each dataset I need to understand if it is either
A new version of a dataset which concept is already captured by an existing dataset. In this case we need to update the data but preserve the name. The old data may remain for a while in a deprecated state or be retired immediately.
A new dataset which concept is not yet captured by an existing dataset. In this case we need to introduce a new name that captures the new concept. In this case:
I understand that the new emissions*companies
dataset is of type 1. The two new emissions*products*
datasets are less clear:
emissions*products*
datasets are no longer useful, then their names become available so we could interpret the new emissions*products*
datasets as a new version and move the old emissions*products*
datasets to a deprecated/ directory. emissions*products*
datasets are still useful, then their names are not available and we must interpret the new emissions*products
datasets as conceptually new and introduce the new names emissions*products_ecoinvent
.@AnneSchoenauer and @kalashsinghal what do you think? Are the old emissions*products*
datasets still useful?
While I wait for answers, I'll prepare a PR with my proposed changes so that you can see exactly what it looks like, and how the deprecation could be implemented.
Hi everyone,
As I was not here last week, I think I am a bit puzzled. Let me try to summarise my thoughts:
Maybe @kalashsinghal you have an opinion about this and maybe @maurolepore this already helps. If not please let me know and I try to understand it better. Best Ane
@AnneSchoenauer I can also apply benchmarking to old datasets as well. That's not the issue. The issue is I am not sure whether to apply benchmarking and ranking to old datasets or not because old datasets also consider the Multi Match
methodology. And you said in slack, that the multi match approach will be used in future. So, this means in the future I have to apply benchmarking to the old datasets as well if they will be used. right?
@kalashsinghal okay got it. So we did the benchmarking wrong with multi matches. This is why the old dataset would also be wrong. We therefore put the multi matches out for this round. We are currently working methodological wise on a solution here but won't be able to share this in the next two weeks, I think!
@AnneSchoenauer Thanks for clarity on this! @maurolepore We have to keep the old datasets in the deprecated state until two weeks hopefully. After that, based on new insights from Anne we will either keep them or delete them. I hope now its clear. Please let us know if you have any questions!
It's best to discuss in tiltToyDataPrivate: https://github.com/2DegreesInvesting/tiltToyDataPrivate/pull/1 That's a private repo where we can more easily discuss private issues.
Once the conversation is solved, I'll bring the public data and code back here.
See 2DegreesInvesting/tiltToyDataPrivate#1
Relates to 2DegreesInvesting/tiltIndicator#566
We need a new
ecoinvent
dataset:values_to_categorize
(or similar) computed asrank(co2_footprint) / lengh(co2_footprint)
.The structure should be very similar to the "co2" datasets. I understand that the main difference between
ecoinvent
and the "co2" datasets is that the "co2" datasets only have products that exist in europages. If that's the case, then the difference is in the rows but not the columns.@AnneSchoenauer drafted the columns here, and @kalashsinghal may have good ideas since he's familiar with the way we pre-process data.