2DegreesInvesting / tiltToyData

Toy datasets for TILT
https://2degreesinvesting.github.io/tiltToyData/
GNU General Public License v3.0
0 stars 0 forks source link

Update the toy `emission_profile*` datasets #7

Closed maurolepore closed 5 months ago

maurolepore commented 8 months ago

See 2DegreesInvesting/tiltToyDataPrivate#1

Relates to 2DegreesInvesting/tiltIndicator#566

We need a new ecoinvent dataset:

The structure should be very similar to the "co2" datasets. I understand that the main difference between ecoinvent and the "co2" datasets is that the "co2" datasets only have products that exist in europages. If that's the case, then the difference is in the rows but not the columns.

@AnneSchoenauer drafted the columns here, and @kalashsinghal may have good ideas since he's familiar with the way we pre-process data.

maurolepore commented 7 months ago

Here is my idea. All of this would happen in tiltIndicatorBefore.

The new ecoinvent dataset has products that do and don't exist in europages. It's passed to emissions_profile_add_values_to_categorize() which adds the columns grouped_by and values_to_categorize. The output is then reduced to only europages products by joining a "co2" dataset (products or inputs) by a join by activity_uuid_product_uuid. The result is a new version of the "co2" dataset (products or inputs) which can be passed to emissions_profile*().

cc' @AnneSchoenauer @kalashsinghal

library(dplyr, warn.conflicts = FALSE)
path_to_tilt_indicator <- "~/git/tiltIndicator"
devtools::load_all(path_to_tilt_indicator)
#> ℹ Loading tiltIndicator

# NEW PRE-PROCESSING HELPER (maybe defined in tiltIndicatorBefore)
compute_values_to_categorize <- function(co2, ecoinvent) {
  ranked <- emissions_profile_any_add_values_to_categorize(ecoinvent)

  cols_to_add <- c("grouped_by", "values_to_categorize")
  cols_to_join_by <- "activity_uuid_product_uuid"
  relevant_cols <- select(ranked, all_of(cols_to_add), all_of(cols_to_join_by))

  left_join(co2, relevant_cols)
}

# NEW ECOINVENT DATASET
# Has products that do and don't exist in europages
# Has the crucial colums required to add the `values_to_categorize`
ecoinvent <- tribble(
  ~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
                      "in-ep",          "a",   "a",       "1234",            "1",
                  "not-in-ep",          "a",   "a",       "1234",            "1"
)

# NEW PRODUCTS DATASET
# Only has products that do exist in europages
old_products <- tribble(
  ~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
                      "in-ep",          "a",   "a",       "1234",            "1"
)

new_products <- compute_values_to_categorize(old_products, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`
new_products |> relocate(c("grouped_by", "values_to_categorize"))
#> # A tibble: 6 × 7
#>   grouped_by       values_to_categorize activity_uuid_produc…¹ tilt_sector unit 
#>   <chr>                           <dbl> <chr>                  <chr>       <chr>
#> 1 all                              0.75 in-ep                  a           a    
#> 2 isic_4digit                      0.75 in-ep                  a           a    
#> 3 tilt_sector                      0.75 in-ep                  a           a    
#> 4 unit                             0.75 in-ep                  a           a    
#> 5 unit_isic_4digit                 0.75 in-ep                  a           a    
#> 6 unit_tilt_sector                 0.75 in-ep                  a           a    
#> # ℹ abbreviated name: ¹​activity_uuid_product_uuid
#> # ℹ 2 more variables: isic_4digit <chr>, co2_footprint <chr>

# NEW INPUTS DATASET
# Only has products that do exist in europages
old_inputs <- tribble(
  ~activity_uuid_product_uuid, ~input_activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector, ~input_unit, ~input_isic_4digit, ~input_co2_footprint, ~type, ~sector, ~subsector,
                      "in-ep",                               "a",                "a",                   "a",         "a",             "1234",                  "1", "ipr", "total",   "energy",
)

new_inputs <- compute_values_to_categorize(old_inputs, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`
new_inputs |> relocate(c("grouped_by", "values_to_categorize"))
#> # A tibble: 6 × 12
#>   grouped_by  values_to_categorize activity_uuid_produc…¹ input_activity_uuid_…²
#>   <chr>                      <dbl> <chr>                  <chr>                 
#> 1 all                         0.75 in-ep                  a                     
#> 2 isic_4digit                 0.75 in-ep                  a                     
#> 3 tilt_sector                 0.75 in-ep                  a                     
#> 4 unit                        0.75 in-ep                  a                     
#> 5 unit_isic_…                 0.75 in-ep                  a                     
#> 6 unit_tilt_…                 0.75 in-ep                  a                     
#> # ℹ abbreviated names: ¹​activity_uuid_product_uuid,
#> #   ²​input_activity_uuid_product_uuid
#> # ℹ 8 more variables: input_tilt_sector <chr>, input_tilt_subsector <chr>,
#> #   input_unit <chr>, input_isic_4digit <chr>, input_co2_footprint <chr>,
#> #   type <chr>, sector <chr>, subsector <chr>

Created on 2023-11-13 with reprex v2.0.2

maurolepore commented 7 months ago

@AnneSchoenauer and @kalashsinghal please discuss if this makes sense or you have a different idea. Remember I'm not familiar with data-preparation so I may be easily wrong.

kalashsinghal commented 7 months ago

@maurolepore Do you want the values in grouped_by column to be the way you mentioned above or they should be like isic_sec, tilt_sec, unit_isic_sec, and unit_tilt_sec (because that's how they are in tiltIndicator output) ?

AnneSchoenauer commented 7 months ago

Also adding all and unit as a group to it right @kalashsinghal

kalashsinghal commented 7 months ago

@AnneSchoenauer @maurolepore What's the need of ecoinvent products which don't exist in Europages?

FYI: I see that I have to add sectors information to the ecoinevent datasets with a completely different code to provide this data. Current code only assign sectoral information to the ecoinvent products which exist in Europages.

maurolepore commented 7 months ago

@AnneSchoenauer

Also adding all and unit as a group to it right @kalashsinghal

The reprex above shows that grouped_by already has values "all" and "unit". Is that what you mean?

maurolepore commented 7 months ago

@kalashsinghal

What's the need of ecoinvent products which don't exist in Europages?

As far as I know we don't use them. So if you already have code that ignores them, they great. I only added them in my reprex because that's how I understood ecoinvent is defined. Without products that do not exist in europages, then as far as I can see, my draft of ecoinvent would be just like the "co2" datasets.

maurolepore commented 7 months ago

@kalashsinghal

@maurolepore Do you want the values in grouped_by column to be the way you mentioned above or they should be like isic_sec, tilt_sec, unit_isic_sec, and unit_tilt_sec (because that's how they are in tiltIndicator output) ?

I'm not sure what you mean by " that's how they are in tiltIndicator output". I think you can use the columns as I show in the reprex above, which will be passed to grouped_by, and used in tiltIndicator just fine. Here is a new reprex showing the workflow all the way until the output of tiltIndicator. It looks good to me. Do you see any issue I don't?

BTW since we seem to not need the products that do not exist in europages, this new reprex is simpler and focuses only on products that do exist in europages. Here it's represented by *uuid = "a".

library(dplyr, warn.conflicts = FALSE)
devtools::load_all()
#> ℹ Loading tiltIndicator

# PRE-PROCESSING
compute_profile_ranking <- function(co2, ecoinvent) {
  ranked <- emissions_profile_any_compute_profile_ranking(ecoinvent)

  cols_to_add <- c("grouped_by", "profile_ranking")
  cols_to_join_by <- "activity_uuid_product_uuid"
  relevant_cols <- select(ranked, all_of(cols_to_add), all_of(cols_to_join_by))

  left_join(co2, relevant_cols)
}

ecoinvent <- tribble(
  ~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
                          "a",          "a",   "a",       "1234",            "1",
)

old_products <- tribble(
  ~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
                          "a",          "a",   "a",       "1234",            "1"
)

new_products <- compute_profile_ranking(old_products, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`

new_products |> relocate(grouped_by, activity_uuid_product_uuid)
#> # A tibble: 6 × 7
#>   grouped_by  activity_uuid_produc…¹ tilt_sector unit  isic_4digit co2_footprint
#>   <chr>       <chr>                  <chr>       <chr> <chr>       <chr>        
#> 1 all         a                      a           a     1234        1            
#> 2 isic_4digit a                      a           a     1234        1            
#> 3 tilt_sector a                      a           a     1234        1            
#> 4 unit        a                      a           a     1234        1            
#> 5 unit_isic_… a                      a           a     1234        1            
#> 6 unit_tilt_… a                      a           a     1234        1            
#> # ℹ abbreviated name: ¹​activity_uuid_product_uuid
#> # ℹ 1 more variable: profile_ranking <dbl>

# tiltIndicator

companies <- tribble(
  ~company_id, ~clustered, ~activity_uuid_product_uuid, ~sector, ~subsector, ~tilt_sector, ~tilt_subsector, ~type,
              "a",        "a",                     "a", "total",   "energy",          "a",             "a", "ipr"
)

result <- emissions_profile(companies, new_products)

result |> unnest_product()
#> # A tibble: 6 × 6
#>   companies_id grouped_by       risk_category clustered activity_uuid_product_…¹
#>   <chr>        <chr>            <chr>         <chr>     <chr>                   
#> 1 a            all              high          a         a                       
#> 2 a            isic_4digit      high          a         a                       
#> 3 a            tilt_sector      high          a         a                       
#> 4 a            unit             high          a         a                       
#> 5 a            unit_isic_4digit high          a         a                       
#> 6 a            unit_tilt_sector high          a         a                       
#> # ℹ abbreviated name: ¹​activity_uuid_product_uuid
#> # ℹ 1 more variable: co2_footprint <chr>

result |> unnest_company()
#> # A tibble: 18 × 4
#>    companies_id grouped_by       risk_category value
#>    <chr>        <chr>            <chr>         <dbl>
#>  1 a            all              high              1
#>  2 a            all              medium            0
#>  3 a            all              low               0
#>  4 a            isic_4digit      high              1
#>  5 a            isic_4digit      medium            0
#>  6 a            isic_4digit      low               0
#>  7 a            tilt_sector      high              1
#>  8 a            tilt_sector      medium            0
#>  9 a            tilt_sector      low               0
#> 10 a            unit             high              1
#> 11 a            unit             medium            0
#> 12 a            unit             low               0
#> 13 a            unit_isic_4digit high              1
#> 14 a            unit_isic_4digit medium            0
#> 15 a            unit_isic_4digit low               0
#> 16 a            unit_tilt_sector high              1
#> 17 a            unit_tilt_sector medium            0
#> 18 a            unit_tilt_sector low               0

Created on 2023-11-13 with reprex v2.0.2

maurolepore commented 7 months ago

@kalashsinghal

Note that if we keep using the "input_" prefix then you'll need to use an ecoinvent dataset that has that prefix in the names -- at least with this implementation of emissions_profile_any_compute_profile_ranking().

BTW I think the "input_" prefix is very expensive to maintain and thus prone to bugs. I may remove it internally (see https://github.com/2DegreesInvesting/tiltIndicator/issues/607) but it would be best to remove it from everywhere. products$tilt_sector is clearly different than input$tilt_sector and input$input_tilt_sector is redundant and hard to program with. cc' @AnneSchoenauer

# styler: off
library(dplyr, warn.conflicts = FALSE)
devtools::load_all()
#> ℹ Loading tiltIndicator

# PRE-PROCESSING
compute_profile_ranking <- function(co2, ecoinvent) {

  ranked <- emissions_profile_any_compute_profile_ranking(ecoinvent)

  cols_to_add <- c("grouped_by", "profile_ranking")
  cols_to_join_by <- "activity_uuid_product_uuid"
  relevant_cols <- select(ranked, all_of(cols_to_add), all_of(cols_to_join_by))

  left_join(co2, relevant_cols)
}

ecoinvent <- tribble(
  ~activity_uuid_product_uuid, ~tilt_sector, ~unit, ~isic_4digit, ~co2_footprint,
                          "a",          "a",   "a",       "1234",            "1",
) |> 
  # Here is what you need to raname
  rename_with(~ paste0("input_", .x), -activity_uuid_product_uuid)

old_inputs <- tribble(
  ~activity_uuid_product_uuid, ~input_activity_uuid_product_uuid, ~input_tilt_sector, ~input_tilt_subsector, ~input_unit, ~input_isic_4digit, ~input_co2_footprint, ~type, ~sector, ~subsector,
                          "a",                               "a",                "a",                   "a",         "a",             "1234",                  "1", "ipr", "total",   "energy"
)

new_inputs <- compute_profile_ranking(old_inputs, ecoinvent)
#> Joining with `by = join_by(activity_uuid_product_uuid)`

new_inputs |> relocate(grouped_by, activity_uuid_product_uuid)
#> # A tibble: 6 × 12
#>   grouped_by     activity_uuid_produc…¹ input_activity_uuid_…² input_tilt_sector
#>   <chr>          <chr>                  <chr>                  <chr>            
#> 1 all            a                      a                      a                
#> 2 input_isic_4d… a                      a                      a                
#> 3 input_tilt_se… a                      a                      a                
#> 4 input_unit     a                      a                      a                
#> 5 input_unit_in… a                      a                      a                
#> 6 input_unit_in… a                      a                      a                
#> # ℹ abbreviated names: ¹​activity_uuid_product_uuid,
#> #   ²​input_activity_uuid_product_uuid
#> # ℹ 8 more variables: input_tilt_subsector <chr>, input_unit <chr>,
#> #   input_isic_4digit <chr>, input_co2_footprint <chr>, type <chr>,
#> #   sector <chr>, subsector <chr>, profile_ranking <dbl>

# tiltIndicator

companies <- tribble(
  ~company_id, ~clustered, ~activity_uuid_product_uuid, ~sector, ~subsector, ~tilt_sector, ~tilt_subsector, ~type,
          "a",        "a",                     "a",     "total",   "energy",          "a",             "a", "ipr"
)

result <- emissions_profile_upstream(companies, new_inputs)

result |> unnest_product()
#> # A tibble: 6 × 7
#>   companies_id grouped_by         risk_category clustered activity_uuid_produc…¹
#>   <chr>        <chr>              <chr>         <chr>     <chr>                 
#> 1 a            all                high          a         a                     
#> 2 a            input_isic_4digit  high          a         a                     
#> 3 a            input_tilt_sector  high          a         a                     
#> 4 a            input_unit         high          a         a                     
#> 5 a            input_unit_input_… high          a         a                     
#> 6 a            input_unit_input_… high          a         a                     
#> # ℹ abbreviated name: ¹​activity_uuid_product_uuid
#> # ℹ 2 more variables: input_activity_uuid_product_uuid <chr>,
#> #   input_co2_footprint <chr>

result |> unnest_company()
#> # A tibble: 18 × 4
#>    companies_id grouped_by                   risk_category value
#>    <chr>        <chr>                        <chr>         <dbl>
#>  1 a            all                          high              1
#>  2 a            all                          medium            0
#>  3 a            all                          low               0
#>  4 a            input_isic_4digit            high              1
#>  5 a            input_isic_4digit            medium            0
#>  6 a            input_isic_4digit            low               0
#>  7 a            input_tilt_sector            high              1
#>  8 a            input_tilt_sector            medium            0
#>  9 a            input_tilt_sector            low               0
#> 10 a            input_unit                   high              1
#> 11 a            input_unit                   medium            0
#> 12 a            input_unit                   low               0
#> 13 a            input_unit_input_isic_4digit high              1
#> 14 a            input_unit_input_isic_4digit medium            0
#> 15 a            input_unit_input_isic_4digit low               0
#> 16 a            input_unit_input_tilt_sector high              1
#> 17 a            input_unit_input_tilt_sector medium            0
#> 18 a            input_unit_input_tilt_sector low               0

# styler: on

Created on 2023-11-13 with reprex v2.0.2

AnneSchoenauer commented 7 months ago

@kalashsinghal

What's the need of ecoinvent products which don't exist in Europages?

As far as I know we don't use them. So if you already have code that ignores them, they great. I only added them in my reprex because that's how I understood ecoinvent is defined. Without products that do not exist in europages, then as far as I can see, my draft of ecoinvent would be just like the "co2" datasets.

Hi both - I think we do need all ecoinvent activites and we would also need to match all ecoinvent activities to tilt_sectors and tilt_subsectors to be able to do the benchmarking.

The point here Kalash is that we want to compare ALL eocinvent activities to each other. And then we only pick those that are also in europages. Therefore, the europages products are compared to ALL ecoinvent activities and not only to the products that were able to be matched with europages. This was the whole point of creating the issue here. So this would be crucial to include ALL ecoinvent activities, creating for them the values_to_categorise with regard to the six benchmarks (tilt_sec, tilt_sec_unit, isic_sec, isic_sec_unit, all, unit).

If you have questions, let me know @kalashsinghal and we can discuss.

maurolepore commented 7 months ago

Oh, of course -- that was the whole point! Thanks Anne :-)

kalashsinghal commented 7 months ago

The point here Kalash is that we want to compare ALL eocinvent activities to each other. And then we only pick those that are also in europages. Therefore, the europages products are compared to ALL ecoinvent activities and not only to the products that were able to be matched with europages. This was the whole point of creating the issue here. So this would be crucial to include ALL ecoinvent activities, creating for them the values_to_categorise with regard to the six benchmarks (tilt_sec, tilt_sec_unit, isic_sec, isic_sec_unit, all, unit).

@AnneSchoenauer @maurolepore This means we need these three datasets:

  1. Emissions_profile_products dataset which only has europages products (with grouped_by and profile_ranking)
  2. Emissions_profile_upstream_products dataset which only has europages products (with grouped_by and profile_ranking)
  3. Ecoinvent dataset which contains all ecoinvent products (with grouped_by and profile_ranking)

Am I right?

Please let me know if it does make sense to combine Emissions_profile_products and Emissions_profile_upstream_products into a single dataframe because in that case we might need to create separate columns of profile_ranking and profile_ranking_upstream in a single dataset, however the grouped_by will remain common. Thanks!

maurolepore commented 7 months ago

Yes, I believe we need those three datasets. So no, I would not combine products and inputs (upstream-products) into a single dataset.

Let's see what Anne says, but do you see any good reason to combine them?

What I would do (eventually) is remove the "input_" prefix as explained above.

kalashsinghal commented 7 months ago

What I would do (eventually) is remove the "input_" prefix as explained above.

@maurolepore I believe the column names would remain the same (for ex: input_tilt_sector) but only the values in grouped_by column will change like tilt_sector. right?

maurolepore commented 7 months ago

In the current implementation grouped_by stores values that come from the column names. In the reprex above you can see that if ecoinvent has the column name "tilt_sector" then grouped_by gets the value "tilt_sector". And if ecoinvent has the column name "input_tilt_sector" then grouped_by gets the value "input_tilt_sector".

For now it seems simpler to continue to use the columns names we used before. But eventually I think we should remove the "input_" prefix from everywhere.

AnneSchoenauer commented 7 months ago

Hi both,

As I said before - we won't need any data from europages. Only data from ecoinvent.

Please take the ecoinvent data and take only the activities in the geographies that we filtered (see geography filter from Bob). The columns that we need are:

To derive at the tilt_sec and tilt_subsector please use first the sector mapper and for those were the sector mapper don't work, sector resolving.

You can also see the whole idea here: https://github.com/2DegreesInvesting/tiltIndicator/issues/566

Thanks a lot.

kalashsinghal commented 7 months ago

@maurolepore You might have updated it somewhere else before, however please update once again the input files' structure that you would need for tiltIndicator after this change. I will adjust my code accordingly. Thanks!

Note: Only for Emissions_profile and Emissions_profile_upstream indicators

maurolepore commented 7 months ago

@kalashsinghal,

Since this PR, tiltIndicator already knows how to handle the new data. Sure, I can update tiltToyData so that the toy "co2" datasets (products and inputs) gain the columns grouped_by and profile_ranking. But that's all that tiltIndicator uses. So I guess that shoulddn't block you?

What I would do in the toy datasets is what I showed in the reprex above (see new_products and new_inputs). But the conversation seems to have continued between you and Anne. If that's good to you, then sure I can create a PR to tiltToyData and ask for your review.

kalashsinghal commented 7 months ago

@maurolepore Yes please create a PR for the toyData which will be inputs to tiltIndicator. I want to remain aligned while working on the input requirements. And I guess there could be different datasets I might give you but I am not sure about that as of yet. We can discuss more about that in that PR.

kalashsinghal commented 7 months ago

@maurolepore I have emailed you the toy datasets which could be possible inputs to tiltIndicator. Please create tickets for any new issue.

maurolepore commented 7 months ago

Thanks @kalashsinghal, the datasets look great. Note they have an unexpected first column but that's easy to remove.

I share a reprex in the related PR in tiltIndicatorBefore (here) because I wanted to leave a record of the data in that PR .

I understand these are public, toy datasets, so I'll go ahead and add them to this repo.

Thanks!

maurolepore commented 7 months ago

For the record,

These are the current datasets (stored in inst/extdata/)

toy_emissions_profile_any_companies()
toy_emissions_profile_products()
toy_emissions_profile_upstream_products()
toy_sector_profile_any_scenarios()
toy_sector_profile_companies()
toy_sector_profile_upstream_companies()
toy_sector_profile_upstream_products()

This issue aims to develop new datasets or new versions of existing emissions* datasets. For each dataset I need to understand if it is either

  1. A new version of a dataset which concept is already captured by an existing dataset. In this case we need to update the data but preserve the name. The old data may remain for a while in a deprecated state or be retired immediately.

  2. A new dataset which concept is not yet captured by an existing dataset. In this case we need to introduce a new name that captures the new concept. In this case:

    • 2.a. The new concept complements the old concepts which are still useful.
    • 2.b. The new concept replaces some old concepts which are no longer useful.

I understand that the new emissions*companies dataset is of type 1. The two new emissions*products* datasets are less clear:

@AnneSchoenauer and @kalashsinghal what do you think? Are the old emissions*products* datasets still useful?

While I wait for answers, I'll prepare a PR with my proposed changes so that you can see exactly what it looks like, and how the deprecation could be implemented.

AnneSchoenauer commented 7 months ago

Hi everyone,

As I was not here last week, I think I am a bit puzzled. Let me try to summarise my thoughts:

Maybe @kalashsinghal you have an opinion about this and maybe @maurolepore this already helps. If not please let me know and I try to understand it better. Best Ane

kalashsinghal commented 7 months ago

@AnneSchoenauer I can also apply benchmarking to old datasets as well. That's not the issue. The issue is I am not sure whether to apply benchmarking and ranking to old datasets or not because old datasets also consider the Multi Match methodology. And you said in slack, that the multi match approach will be used in future. So, this means in the future I have to apply benchmarking to the old datasets as well if they will be used. right?

AnneSchoenauer commented 7 months ago

@kalashsinghal okay got it. So we did the benchmarking wrong with multi matches. This is why the old dataset would also be wrong. We therefore put the multi matches out for this round. We are currently working methodological wise on a solution here but won't be able to share this in the next two weeks, I think!

kalashsinghal commented 7 months ago

@AnneSchoenauer Thanks for clarity on this! @maurolepore We have to keep the old datasets in the deprecated state until two weeks hopefully. After that, based on new insights from Anne we will either keep them or delete them. I hope now its clear. Please let us know if you have any questions!

maurolepore commented 7 months ago

It's best to discuss in tiltToyDataPrivate: https://github.com/2DegreesInvesting/tiltToyDataPrivate/pull/1 That's a private repo where we can more easily discuss private issues.

Once the conversation is solved, I'll bring the public data and code back here.