2DegreesInvesting / tiltToyData

Toy datasets for TILT
https://2degreesinvesting.github.io/tiltToyData/
GNU General Public License v3.0
0 stars 0 forks source link

Ensure `*products_ecoinvent` have these properties #22

Closed AnneSchoenauer closed 5 months ago

AnneSchoenauer commented 6 months ago

Relates to #19

Dear @maurolepore,

As we saw in the Tech Meeting today, I think it is needed that we all understand once the whole process with the profile_ranking and why I asked several times if we still have a dataset with the "clustered" (the europage products). I also would want to make one enhancement to the toy_datasets so that they become more realistic (even though they are still toy data sets). Maybe it is already implemented but I want to make sure with this ticket.

The enhancement would be that the toy dataset for the emission_profile_products (which are the ecoinvent products) have activitiy_uuid_product_uuid that are not existing in the emission_profile_any_company dataset. So that it looks like this.

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Create the first dataframe: emissions_profile_products
emissions_profile_products <- data.frame(
  co2_footprint = c(176, 58.1, 4.95, 12.5, 2.07, 176, 58.1, 4.95, 12.5, 2.07),
  tilt_sector = c("Industry", "Industry", "Steel & Metals", "Agriculture", "Industry", "Industry", "Industry", "Steel & Metals", "Agriculture", "Industry"),
  tilt_subsector = c("Other", "Other", "Steel", "Agriculture", "Other", "Other", "Other", "Steel", "Agriculture", "Other"),
  unit = c("unit", "unit", "kg", "kg", "kg", "unit", "unit", "kg", "kg", "kg"),
  isic_4digit = c("2560", "2560", "2870", "1780", "2679", "2560", "2560", "2870", "1780", "2679"),
  activity_uuid_product_uuid = c("UUID1", "UUID2", "UUID3", "UUID4", "UUID5", "UUID6", "UUID7", "UUID8", "UUID9", "UUID10"),
  ei_activity_name = c("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", "Name7", "Name8", "Name9", "Name10")
)

# Create the second dataframe: emissions_profile_any_companies
emissions_profile_any_companies <- data.frame(
  activity_uuid_product_uuid = c("UUID1", "UUID2", "UUID3", "UUID7", "UUID5", "UUID6", "UUID10", "UUID2", "UUID1"),
  clustered = c("stove", "oven", "steel", "aged cheese", "rubber", "cream", "apple", "pear", "oven"),
  companies_id = c("fleischerei…", "fleischerei…", "pecheries-b…", "hoche-butte…", "vicquelin-e…", 
                   "bst-procont…", "leider-gmbh…", "cheries-baq…", "ca-coity-tr…"),
  unit = c("unit", "unit", "kg", "kg", "kg", "kg", "kg", "kg", "NA")
)

# Find UUIDs that are in emissions_profile_products but not in emissions_profile_any_companies
uuids_not_in_companies <- setdiff(emissions_profile_products$activity_uuid_product_uuid, 
                                  emissions_profile_any_companies$activity_uuid_product_uuid)

# Print the results
print(uuids_not_in_companies)
#> [1] "UUID4" "UUID8" "UUID9"

Created on 2024-01-10 with reprex v2.0.2

Alrigth - so WHY is this important and again I think it is crucial to understand this. When it comes to profile ranking the process should be a such 1) Calculate the profile ranking on the WHOLE ecoinvent product data set. 2) Calculate the risk categories on this. 3) Matcihng it with the company data set. The company dataset will have some products that were not able to be matched with ecoinvent. And therefore, some ecoinvent products will be filtered out. Again UUID4, UUID 8 and UUID9. However, the profile ranking of all the other UUIDs will be the same as they were pre-calculated before matching. In a reprex this looks like this:

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Create the emissions_profile_products dataframe
emissions_profile_products <- data.frame(
  co2_footprint = c(176, 58.1, 4.95, 12.5, 2.07, 176, 58.1, 4.95, 12.5, 2.07),
  tilt_sector = c("Industry", "Industry", "Steel & Metals", "Agriculture", "Industry", "Industry", "Industry", "Steel & Metals", "Agriculture", "Industry"),
  tilt_subsector = c("Other", "Other", "Steel", "Agriculture", "Other", "Other", "Other", "Steel", "Agriculture", "Other"),
  unit = c("unit", "unit", "kg", "kg", "kg", "unit", "unit", "kg", "kg", "kg"),
  isic_4digit = c("2560", "2560", "2870", "1780", "2679", "2560", "2560", "2870", "1780", "2679"),
  activity_uuid_product_uuid = c("UUID1", "UUID2", "UUID3", "UUID4", "UUID5", "UUID6", "UUID7", "UUID8", "UUID9", "UUID10"),
  ei_activity_name = c("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", "Name7", "Name8", "Name9", "Name10")
)

# Add new variables to the dataset
emissions_profile_products <- emissions_profile_products %>%
  mutate(
    benchmark = "all",  # Add benchmark variable
    profile_ranking = min_rank(-co2_footprint) / n(),  # Normalized ranking between 0 and 1
    # Determine risk share based on profile_ranking
    risk_share = case_when(
      profile_ranking > 0.6666 ~ "high",
      profile_ranking > 0.3333 ~ "medium",
      TRUE ~ "low"
    )
  )

# Print the updated dataframe
print(emissions_profile_products)
#>    co2_footprint    tilt_sector tilt_subsector unit isic_4digit
#> 1         176.00       Industry          Other unit        2560
#> 2          58.10       Industry          Other unit        2560
#> 3           4.95 Steel & Metals          Steel   kg        2870
#> 4          12.50    Agriculture    Agriculture   kg        1780
#> 5           2.07       Industry          Other   kg        2679
#> 6         176.00       Industry          Other unit        2560
#> 7          58.10       Industry          Other unit        2560
#> 8           4.95 Steel & Metals          Steel   kg        2870
#> 9          12.50    Agriculture    Agriculture   kg        1780
#> 10          2.07       Industry          Other   kg        2679
#>    activity_uuid_product_uuid ei_activity_name benchmark profile_ranking
#> 1                       UUID1            Name1       all             0.1
#> 2                       UUID2            Name2       all             0.3
#> 3                       UUID3            Name3       all             0.7
#> 4                       UUID4            Name4       all             0.5
#> 5                       UUID5            Name5       all             0.9
#> 6                       UUID6            Name6       all             0.1
#> 7                       UUID7            Name7       all             0.3
#> 8                       UUID8            Name8       all             0.7
#> 9                       UUID9            Name9       all             0.5
#> 10                     UUID10           Name10       all             0.9
#>    risk_share
#> 1         low
#> 2         low
#> 3        high
#> 4      medium
#> 5        high
#> 6         low
#> 7         low
#> 8        high
#> 9      medium
#> 10       high

# Joining the datasets
joined_data <- emissions_profile_products %>%
  left_join(emissions_profile_any_companies, by = "activity_uuid_product_uuid")
#> Error: Objekt 'emissions_profile_any_companies' nicht gefunden

# Printing the joined dataset
print(joined_data)
#> Error in eval(expr, envir, enclos): Objekt 'joined_data' nicht gefunden

# Find UUIDs that are in emissions_profile_products but not in emissions_profile_any_companies
uuids_not_in_companies <- setdiff(emissions_profile_products$activity_uuid_product_uuid, 
                                  emissions_profile_any_companies$activity_uuid_product_uuid)
#> Error in eval(expr, envir, enclos): Objekt 'emissions_profile_any_companies' nicht gefunden

# Print the results
print(uuids_not_in_companies)
#> Error in eval(expr, envir, enclos): Objekt 'uuids_not_in_companies' nicht gefunden

Created on 2024-01-10 with reprex v2.0.2

Please note @maurolepore that this reprex doesn't use the correct profile_ranking variable. I just wanted to make the point that the profile ranking is calculated first, then the risk categories and we then do the matching with europages companies and products (clustered).

Does this make it clear?

maurolepore commented 6 months ago

Thanks for your patience @AnneSchoenauer,

I believe we all understand the requirement but it's great to check here it's actually implemented as you expect.

Note we're mixing the behavior of the code (tiltIndicator) with the characteristics of the toy datasets (tiltToyData). I'll discuss each separately.

Code

For the record the PR that implemented your request is in tiltIndicator tiltIndicator#644.

This search gives you an overview of what the code currently does (likely it's the code you said Kalash showed you):

You can see that we:

image

Toy data

For the record, Kalash shared with me some small real datasets that presumably have these properties. Finally I turned turn those small real datasets into toy datasets we can share publicly. The result is this PR: #19.

Thanks for your reprex, particularly for the meaningful variable names :-)

Here is my version, using the toy datasets we are developing in #19 (reproduced from here):

library(readr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
devtools::load_all()
#> ℹ Loading tiltToyData

options(readr.show_col_types = FALSE, width = 1000)

companies <- read_csv(toy_emissions_profile_any_companies())
products <- read_csv(toy_emissions_profile_products_ecoinvent())

# *uuid in companies that match *uuid in products
left_join(companies, products, relationship = "many-to-many") |> 
  print() |> 
  distinct(activity_uuid_product_uuid)
#> Joining with `by = join_by(activity_uuid_product_uuid, ei_activity_name, unit)`
#> # A tibble: 155 × 12
#>    activity_uuid_product_uuid           clustered                   companies_id            country ei_activity_name                                              main_activity unit  co2_footprint ei_geography isic_4digit tilt_sector  tilt_subsector          
#>    <chr>                                <chr>                       <chr>                   <chr>   <chr>                                                         <chr>         <chr>         <dbl> <chr>        <chr>       <chr>        <chr>                   
#>  1 76269c17-78d6-420b-991a-aa38c51b45b7 tent                        soot_asianpiedstarling  germany market for shed, large, wood, non-insulated, fire-unprotected distributor   m2          447.    GLO          '4100'      construction construction residential
#>  2 76269c17-78d6-420b-991a-aa38c51b45b7 tent                        soot_asianpiedstarling  germany market for shed, large, wood, non-insulated, fire-unprotected distributor   m2          321.    GLO          '4100'      construction construction residential
#>  3 76269c17-78d6-420b-991a-aa38c51b45b7 table hire for parties      frightening_chrysomelid spain   market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2          447.    GLO          '4100'      construction construction residential
#>  4 76269c17-78d6-420b-991a-aa38c51b45b7 table hire for parties      frightening_chrysomelid spain   market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2          321.    GLO          '4100'      construction construction residential
#>  5 76269c17-78d6-420b-991a-aa38c51b45b7 surface finishing, galvanic hyperbrutal_flea        germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg            0.405 GLO          '2591'      metals       other metals            
#>  6 76269c17-78d6-420b-991a-aa38c51b45b7 surface finishing, galvanic hyperbrutal_flea        germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg            0.884 GLO          '2591'      metals       other metals            
#>  7 76269c17-78d6-420b-991a-aa38c51b45b7 surface finishing, galvanic hyperbrutal_flea        germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg            0.675 GLO          '2591'      metals       other metals            
#>  8 76269c17-78d6-420b-991a-aa38c51b45b7 surface engineering         hyperbrutal_flea        germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg            0.405 GLO          '2591'      metals       other metals            
#>  9 76269c17-78d6-420b-991a-aa38c51b45b7 surface engineering         hyperbrutal_flea        germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg            0.884 GLO          '2591'      metals       other metals            
#> 10 76269c17-78d6-420b-991a-aa38c51b45b7 surface engineering         hyperbrutal_flea        germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg            0.675 GLO          '2591'      metals       other metals            
#> # ℹ 145 more rows
#> # A tibble: 3 × 1
#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 76269c17-78d6-420b-991a-aa38c51b45b7
#> 2 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb
#> 3 833caa78-30df-4374-900f-7f88ab44075b

# *uuid in companies that do NOT match *uuid in products
anti_join(companies, products) |> 
  print() |> 
  distinct(activity_uuid_product_uuid)
#> Joining with `by = join_by(activity_uuid_product_uuid, ei_activity_name, unit)`
#> # A tibble: 4 × 7
#>   activity_uuid_product_uuid           clustered       companies_id                        country     ei_activity_name                                              main_activity unit 
#>   <chr>                                <chr>           <chr>                               <chr>       <chr>                                                         <chr>         <chr>
#> 1 833caa78-30df-4374-900f-7f88ab44075b garden fittings weak_meadowlark                     netherlands market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2   
#> 2 833caa78-30df-4374-900f-7f88ab44075b garden fittings arrogant_ewe                        netherlands market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2   
#> 3 833caa78-30df-4374-900f-7f88ab44075b tent            pseudoeconomical_easternglasslizard germany     market for shed, large, wood, non-insulated, fire-unprotected distributor   m2   
#> 4 833caa78-30df-4374-900f-7f88ab44075b tent            charterable_wren                    germany     market for shed, large, wood, non-insulated, fire-unprotected distributor   m2
#> # A tibble: 1 × 1
#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 833caa78-30df-4374-900f-7f88ab44075b

# *uuid in products that do NOT match *uuid in companies
anti_join(products, companies) |> 
  print() |> 
  distinct(activity_uuid_product_uuid)
#> Joining with `by = join_by(activity_uuid_product_uuid, ei_activity_name, unit)`
#> # A tibble: 8 × 8
#>   activity_uuid_product_uuid           co2_footprint ei_activity_name                                         ei_geography isic_4digit tilt_sector tilt_subsector unit 
#>   <chr>                                        <dbl> <chr>                                                    <chr>        <chr>       <chr>       <chr>          <chr>
#> 1 833caa78-30df-4374-900f-7f88ab44075b        14.1   iron-nickel-chromium alloy production                    RER          '2410'      metals      iron & steel   kg   
#> 2 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb         0.419 market for deep drawing, steel, 10000 kN press, automode GLO          '2591'      metals      other metals   kg   
#> 3 833caa78-30df-4374-900f-7f88ab44075b         9.47  iron-nickel-chromium alloy production                    RER          '2410'      metals      iron & steel   kg   
#> 4 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb         0.648 market for deep drawing, steel, 10000 kN press, automode GLO          '2591'      metals      other metals   kg   
#> 5 833caa78-30df-4374-900f-7f88ab44075b        13.6   iron-nickel-chromium alloy production                    RER          '2410'      metals      iron & steel   kg   
#> 6 833caa78-30df-4374-900f-7f88ab44075b        14.7   iron-nickel-chromium alloy production                    RER          '2410'      metals      iron & steel   kg   
#> 7 833caa78-30df-4374-900f-7f88ab44075b         0.390 market for deep drawing, steel, 10000 kN press, automode GLO          '2591'      metals      other metals   kg   
#> 8 833caa78-30df-4374-900f-7f88ab44075b        12.7   iron-nickel-chromium alloy production                    RER          '2410'      metals      iron & steel   kg
#> # A tibble: 2 × 1
#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 833caa78-30df-4374-900f-7f88ab44075b
#> 2 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb

companies <- read_csv(toy_emissions_profile_any_companies())
inputs <- read_csv(toy_emissions_profile_upstream_products_ecoinvent())

# *uuid in companies that match *uuid in inputs
left_join(companies, inputs, relationship = "many-to-many") |> 
  print() |> 
  distinct(activity_uuid_product_uuid)
#> Joining with `by = join_by(activity_uuid_product_uuid, ei_activity_name)`
#> # A tibble: 97 × 16
#>    activity_uuid_product_uuid           clustered                   companies_id                         country ei_activity_name                                              main_activity unit  ei_geography input_activity_uuid_product_uuid                                          input_co2_footprint input_ei_activity_name                                          input_isic_4digit input_reference_product_name                       input_tilt_sector input_tilt_subsector     input_unit
#>    <chr>                                <chr>                       <chr>                                <chr>   <chr>                                                         <chr>         <chr> <chr>        <chr>                                                                                   <dbl> <chr>                                                           <chr>             <chr>                                              <chr>             <chr>                    <chr>     
#>  1 76269c17-78d6-420b-991a-aa38c51b45b7 tent                        soot_asianpiedstarling               germany market for shed, large, wood, non-insulated, fire-unprotected distributor   m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                240. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#>  2 76269c17-78d6-420b-991a-aa38c51b45b7 table hire for parties      frightening_chrysomelid              spain   market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                240. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#>  3 76269c17-78d6-420b-991a-aa38c51b45b7 surface finishing, galvanic hyperbrutal_flea                     germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg    <NA>         <NA>                                                                                      NA  <NA>                                                            <NA>              <NA>                                               <NA>              <NA>                     <NA>      
#>  4 76269c17-78d6-420b-991a-aa38c51b45b7 surface engineering         hyperbrutal_flea                     germany market for deep drawing, steel, 10000 kN press, automode      distributor   kg    <NA>         <NA>                                                                                      NA  <NA>                                                            <NA>              <NA>                                               <NA>              <NA>                     <NA>      
#>  5 76269c17-78d6-420b-991a-aa38c51b45b7 tent                        flexible_dolphin                     austria market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                240. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#>  6 76269c17-78d6-420b-991a-aa38c51b45b7 tent                        paramilitary_racerunner              germany market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                240. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#>  7 76269c17-78d6-420b-991a-aa38c51b45b7 open space amenities        level_meadowhawk                     france  market for shed, large, wood, non-insulated, fire-unprotected wholesaler    m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                240. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#>  8 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb tent                        heartrending_attwatersprairiechicken germany market for shed, large, wood, non-insulated, fire-unprotected distributor   m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                463. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#>  9 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb tent                        heartrending_attwatersprairiechicken germany market for shed, large, wood, non-insulated, fire-unprotected distributor   m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                451. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#> 10 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb tent                        heartrending_attwatersprairiechicken germany market for shed, large, wood, non-insulated, fire-unprotected distributor   m2    RoW          bc548877-9cc6-590d-ba72-1d1d2daeb5b9_e2ccc500-255f-448c-8c88-ed25177993df                447. shed construction, large, wood, non-insulated, fire-unprotected '4100'            shed, large, wood, non-insulated, fire-unprotected construction      construction residential m2        
#> # ℹ 87 more rows
#> # A tibble: 3 × 1
#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 76269c17-78d6-420b-991a-aa38c51b45b7
#> 2 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb
#> 3 833caa78-30df-4374-900f-7f88ab44075b

# *uuid in companies that do NOT match *uuid in inputs
anti_join(companies, inputs) |> 
  print() |> 
  distinct(activity_uuid_product_uuid)
#> Joining with `by = join_by(activity_uuid_product_uuid, ei_activity_name)`
#> # A tibble: 4 × 7
#>   activity_uuid_product_uuid           clustered                   companies_id      country ei_activity_name                                         main_activity         unit 
#>   <chr>                                <chr>                       <chr>             <chr>   <chr>                                                    <chr>                 <chr>
#> 1 76269c17-78d6-420b-991a-aa38c51b45b7 surface finishing, galvanic hyperbrutal_flea  germany market for deep drawing, steel, 10000 kN press, automode distributor           kg   
#> 2 76269c17-78d6-420b-991a-aa38c51b45b7 surface engineering         hyperbrutal_flea  germany market for deep drawing, steel, 10000 kN press, automode distributor           kg   
#> 3 76269c17-78d6-420b-991a-aa38c51b45b7 deep-drawn metal part       humanoid_elkhound germany market for deep drawing, steel, 10000 kN press, automode agent/ representative kg   
#> 4 76269c17-78d6-420b-991a-aa38c51b45b7 drawn parts                 humanoid_elkhound germany market for deep drawing, steel, 10000 kN press, automode agent/ representative kg
#> # A tibble: 1 × 1
#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 76269c17-78d6-420b-991a-aa38c51b45b7

# *uuid in inputs that do NOT match *uuid in companies
anti_join(inputs, companies) |> 
  print() |> 
  distinct(activity_uuid_product_uuid)
#> Joining with `by = join_by(activity_uuid_product_uuid, ei_activity_name)`
#> # A tibble: 85 × 11
#>    activity_uuid_product_uuid           ei_activity_name                                         ei_geography                      input_activity_uuid_product_uuid                                          input_co2_footprint input_ei_activity_name                        input_isic_4digit input_reference_product_name                  input_tilt_sector input_tilt_subsector  input_unit
#>    <chr>                                <chr>                                                    <chr>                             <chr>                                                                                   <dbl> <chr>                                         <chr>             <chr>                                         <chr>             <chr>                 <chr>     
#>  1 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb market for deep drawing, steel, 10000 kN press, automode RoW                               55a5ac05-ab15-5a27-9d0e-6ecf840039f1_f10b8722-4be1-43d5-b17d-c51ad0e29d29              0.456  deep drawing, steel, 10000 kN press, automode '2591'            deep drawing, steel, 10000 kN press, automode metals            other metals          kg        
#>  2 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    GLO                               bdc93cd8-00b4-5b3e-993e-b7fef7059e52_4e584f6f-2e71-4796-931e-bb9a273c161c              1.67   market for anode, for metal electrolysis      '2790'            anode, for metal electrolysis                 industry          machinery & equipment kg        
#>  3 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    RER                               95fcd1bb-4dc6-516a-a3b2-30a4f0530639_3b1d249a-c924-4d6c-8e1f-647f562daa54              0.530  market for electric arc furnace dust          '3821'            electric arc furnace dust                     industry          other industry        kg        
#>  4 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    RER                               daef2f9a-4108-52ae-90a7-fe64abad51bc_6e74937e-b691-4c49-9b8f-5ba44d7c081d              0.589  market for electric arc furnace slag          '3821'            electric arc furnace slag                     industry          other industry        kg        
#>  5 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    RER                               3b190359-a32e-5294-af63-983f38ce6525_759b89bd-3aa6-42ad-b767-5bb9ef5d331d              0.602  market group for electricity, medium voltage  '3510'            electricity, medium voltage                   power             total power           kWh       
#>  6 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    GLO                               2c92cdcd-29df-53ba-a209-77c7de201d14_6e316c64-0481-4832-b097-296e14c0b02f              7.32   market for ferrochromium, high-carbon, 68% Cr '2410'            ferrochromium, high-carbon, 68% Cr            metals            iron & steel          kg        
#>  7 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    Europe, without Russia and Turkey 9392c694-12a6-5cd7-a421-d4866359df2c_0d3eda5a-4601-4573-9549-0701c459ab88              0.710  market for hard coal                          '0510'            hard coal                                     energy            coal energy           kg        
#>  8 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    CH                                c18c6cc9-4a26-5c47-9ea9-8635ff2c158e_240c1a3c-1aba-4528-afc3-3f27f56583be              0.0106 market for inert waste, for final disposal    '3821'            inert waste, for final disposal               industry          other industry        kg        
#>  9 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    RER                               c4ec0b1e-2a3b-5700-871c-2adbbb29bc1d_4f312355-ac65-4635-8fb2-006dba64ce60              0.0581 market for iron scrap, sorted, pressed        '3830'            iron scrap, sorted, pressed                   industry          other industry        kg        
#> 10 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb iron-nickel-chromium alloy production                    CH                                7361f7fb-5cf2-598c-823a-a4b7e50c3d28_a9007f10-7e39-4d50-8f4a-d6d03ce3d673              1.22   market for natural gas, high pressure         '3520'            natural gas, high pressure                    energy            gas energy            m3        
#> # ℹ 75 more rows
#> # A tibble: 2 × 1
#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb
#> 2 833caa78-30df-4374-900f-7f88ab44075b

Answering your questions

In your comments I identify the following questions:

Yes. All *companies datasets in tiltToyData have a clustered column. You can see that in the website of tiltToyData under Examples. The new version of emissions_profile_any_companies in PR #19 does not drop any column (if anything it will add some columns).

Yes. In my reprex above search for "anti_join(products, companies)". You'll see

# *uuid in products that do NOT match *uuid in companies

... more code

#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 833caa78-30df-4374-900f-7f88ab44075b
#> 2 bf94b5a7-b7a2-46d1-bb95-84bc560b12fb

Yes. In my reprex above search for "anti_join(companies, products)". You'll see

# *uuid in companies that do NOT match *uuid in products

... more code

#>   activity_uuid_product_uuid          
#>   <chr>                               
#> 1 833caa78-30df-4374-900f-7f88ab44075b

Code and toy data combined

You may explore the code and data for yourself:

  1. Install the latest version of tiltIndicator as usual.
  2. Install tiltToyData#19 with devtools::install_github("2DegreesInvesting/tiltToyData#19")
  3. Copy and run the reprex under Usage in tiltIndicator in https://github.com/2DegreesInvesting/tiltToyData/pull/19#issuecomment-1881121703.

Is this what you expect?

AnneSchoenauer commented 5 months ago

Dear @maurolepore, Thanks a lot for this. Sooo there are two different possibilities of matching *uuids in products that do NOT match uuid in companies AND uuid in companies that do NOT match uiid in products. And I am afraid that I don't understand one case of it. This case here is correct:

uuid in products that do NOT match uuid in companies

anti_join(products, companies) |> print() |> distinct(activity_uuid_product_uuid)

However, this case here:

uuid in companies that do NOT match uuid in products

anti_join(companies, products) |> print() |> distinct(activity_uuid_product_uuid)

that shouldn't be the case. As remeber the product list shows ALL ecoinvent data which means there should be ALL uuids available if the company produces a clustered product that was able to be matched to a uuid.

Could we change this to make teh toy data more realistic?

Maybe this could also be a test in general that we can implement? If there are UUIDs in the company dataset which are not in the product dataset then there must be something wrong as per definition the product dataset (in a real world) captures all ecoinvent data. What do you think? Would that be a good test to also implement?

Also please note that with this ticket here might change the toy Data a bit as we then have in the company data set also company_ids without any uuid. But this is something that we can discuss later as well.

AnneSchoenauer commented 5 months ago

In any case thanks a lot for these reprexes as they really really help to become very precise in what we want!

maurolepore commented 5 months ago

Thanks @AnneSchoenauer for paying close attention.

I think we need to turn your expectation into an error message in tiltIndicator -- since it's a metholologial issue.

See if the reprex here captures the problem you see.

Regarding the toy datasets here, I'll remove the *uuid in companies that don't match *uuid in products or inputs.

@kalashsinghal any idea how those unexpected *uuid got into the toy data? Maybe it's just an artifact of the artificial nature of these datasets, but any thoughts are welcome.

kalashsinghal commented 5 months ago

Regarding the toy datasets here, I'll remove the uuid in companies that don't match uuid in products or inputs. @kalashsinghal any idea how those unexpected *uuid got into the toy data? Maybe it's just an artifact of the artificial nature of these datasets, but any thoughts are welcome.

Dear @maurolepore @AnneSchoenauer I checked this issue with real datasets and there is no such uuid in companies that don't match uuid in products or inputs. This means we have atleast a single match for each uuid in companies dataset from uuid in products or inputs.

I am not 100% sure how the toy datasets are made, I would @maurolepore recommend you to please check how such cases appeared in toy data. Please let me know if you find anything odd! Thanks!