CORE-forge / coresoi

R package for CORE set of indicators
https://core-forge.github.io/coresoi/
MIT License
13 stars 1 forks source link

verbosely select Emergency Scenario #3

Closed NiccoloSalvini closed 1 year ago

NiccoloSalvini commented 1 year ago

context

We would like to have the user to select emergency scenario based on string/pattern, say "covid19", "terremoto aquila" i.e. a string instead the emergency date. This is more intuitive and prevent selecting the wrong date for emergency outbreaks, since we know that those are formally stated by the Authority.

Current behavior

say we are interested in calculating ind_11, whose statistical unit target is "provincia" which measures the Distance between award value and sums paid indicator for a given Emergency scenario defined by a date class ymd() format, default behavior is to set that as lubridate::ymd("2017-06-30"), terremoto aquila AND as target statistical unit of measurament "cf_amministrazione_appaltante" i.e. Contracting authority

library(coresoi)
ind_11(
    data = mock_data_core, 
    publication_date = data_pubblicazione,
    award_value = importo_aggiudicazione, 
    sums_paid = importo_lotto,
    stat_unit = cf_amministrazione_appaltante,
    outbreak_starting_date = lubridate::[ymd](https://lubridate.tidyverse.org/reference/ymd.html)("2017-06-30")
  )

which via the generate_indicator_schema() turns out to have the following table:

# A tibble: 224 × 12
   indicator_id indicator_name    indic…¹ aggre…² aggre…³ aggre…⁴ emerg…⁵ emerg…⁶ count…⁷ count…⁸
          <dbl> <chr>               <dbl> <fct>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>  
 1          5.1 Awarded notice c…   0.95  AGRIGE… ISTAT1  provin…       3 Other   1       Italy  
 2          5.1 Awarded notice c…   0.543 AGRIGE… ISTAT1  provin…       3 Other   1       Italy  
 3          5.1 Awarded notice c…   0.685 ALESSA… ISTAT1  provin…       3 Other   1       Italy  
 4          5.1 Awarded notice c…   0.780 ALESSA… ISTAT1  provin…       3 Other   1       Italy  
 5          5.1 Awarded notice c…   1     ANCONA  ISTAT1  provin…       3 Other   1       Italy  
 6          5.1 Awarded notice c…   0.922 ANCONA  ISTAT1  provin…       3 Other   1       Italy  
 7          5.1 Awarded notice c…   0.896 AREZZO  ISTAT1  provin…       3 Other   1       Italy  
 8          5.1 Awarded notice c…   0.987 AREZZO  ISTAT1  provin…       3 Other   1       Italy  
 9          5.1 Awarded notice c…   0.955 ASCOLI… ISTAT1  provin…       3 Other   1       Italy  
10          5.1 Awarded notice c…   0.924 ASCOLI… ISTAT1  provin…       3 Other   1       Italy  
# … with 214 more rows, 2 more variables: indicator_last_update <dttm>, data_last_update <dttm>,
#   and abbreviated variable names ¹​indicator_value, ²​aggregation_name, ³​aggregation_id,
#   ⁴​aggregation_type, ⁵​emergency_id, ⁶​emergency_name, ⁷​country_id, ⁸​country_name

Moreover say that the user selects a different outbreak_starting_date for that implying the exact same Emergency. This would lead to different results since pre/post aggregation would involve different groups by setting i.e. specifying outbreak_starting_date = lubridate::ymd("2017-09-30"), 3 months after.

Expected behavior

Instead of specifying the date we would force the user to set a scenario by string: "covid19", "Terremoto Aquila" by controlling the options available. This from one side would prevent to specify wrong dates (loosing informative power), on the other will offer a more friendly api user interface to indicators. This would be also coupled up with automatic type error checks suggesting which are the alternative Emergency scenarios.

library(coresoi)
ind_11(
    data = mock_data_core, 
    publication_date = data_pubblicazione,
    award_value = importo_aggiudicazione, 
    sums_paid = importo_lotto,
    stat_unit = cf_amministrazione_appaltante,
    emergency_scenario = "covid19"
  )

ind_11(
    data = mock_data_core, 
    publication_date = data_pubblicazione,
    award_value = importo_aggiudicazione, 
    sums_paid = importo_lotto,
    stat_unit = cf_amministrazione_appaltante,
    emergency_scenario = "terremoto aquila"
  )

... and when you mispelled it then this suggests something like:

ind_11(
    data = mock_data_core, 
    publication_date = data_pubblicazione,
    award_value = importo_aggiudicazione, 
    sums_paid = importo_lotto,
    stat_unit = cf_amministrazione_appaltante,
    emergency_scenario = "terremoto milano"
  )

## 
Error in "emergency_scenario": scenarios available are: "covid19", "terremoto aquila", "ucraine-russia war"

This function should use a named list to store the emergency names and their corresponding dates. It then checks if the input emergency name exists in the list, and if it does, it returns the corresponding date. Otherwise, it returns an error message.

this should be passe within each indicator through the generate_indicator_schema() by a function that given an Emergency scenario string sets the date, then computes pre/post aggregation, in the end.

This is just a sketch of an implementation

emergency_dates <- function(emergency_name) {
  # create a named list of emergency names and their corresponding dates
  emergency_list <- list(
    "Coronavirus" =lubridate::ymd("2020-01-31"),
    "Terremoto Aquila" = lubridate::ymd("2017-06-30"),
    "Terremoto Ischia" = lubridate::dmy("21/08/2017"),
    "Terremoto Centro Italia 2016-2017" = lubridate::dmy("24/08/2016"),
    "Terremoto Emilia-Romagna e Lombardia 2012" = lubridate::dmy("20/05/2012"),
    "Etna - Eruzione 2008-2009" = lubridate::dmy("13/05/2008"),
    "Etna - Eruzione 2006-2007" = lubridate::dmy("14/07/2006"),
    "Etna - Eruzione 2002-2003" = lubridate::dmy("28/10/2002"),
    "Stromboli - Eruzione 2007" = lubridate::dmy("24/08/2016"),
    ...
  )

  # check if the input emergency name exists in the list
  if (emergency_name %in% names(emergency_list)) {
    # if it exists, return the corresponding date
    return(emergency_list[emergency_name])
  } else {
    # if the input emergency name does not exist in the list, return an error message
    return("Error: Emergency not found in list.")
  }
}
This function uses a named list to store the emergency names and their corresponding dates. It then checks if the input emergency name exists in the list, and if it does, it returns the corresponding date. Otherwise, it returns an error message.

As for handling spelling errors, you could use the agrep function in R to implement fuzzy matching. This would allow the function to find approximate matches for the input emergency name, even if it is misspelled. Moreover we might want to have something like emergency_type related to the kind of emergency it is. Say "terremoto Aquila" is the emergency scenario we are looking for, then its type is "seismic", indeed if we are looking for "coronavirus" then that's a "sanitary" type of emergency.

a further point

We might want also to dynamically get updated emergencies when they are out (along with their dates). This is a reference where they can be extracted. We may want to:

giuliogcantone commented 1 year ago

I would keep 2 parameters.

emergency_scenario custom_start_date

3 possible outputs:

I suggest to lowercase any element of the list, then to force a tolower() on the input.

 if (tolower(emergency_name) %in% names(emergency_list)) {
    # if it exists, return the corresponding date
    return(emergency_list[emergency_name])
  } else {
    # if the input emergency name does not exist in the list, return an error message
    return("Error: Emergency not found in list.")
  }
}

As for handling spelling errors, you could use the agrep function in R to implement fuzzy matching. This would allow the function to find approximate matches for the input emergency name, even if it is misspelled.

I would suggest an output like:

Error in "emergency_scenario": 'terremoto acuila' is not a scenario. Suggested scenario: 'terremoto aquila'. Scenarios available: "covid19", "terremoto aquila", "ucraine-russia war"

dynamically refer to the url (via scraping?)

Scraping imo not worth: the scraper must recognize the first date in italian format with a regex code. Say the Minister page uses once a weird format for whatever reason, the scraper will fail and possibly pick another date within the html.

Benefits: if it exists a within-package emergencies updater, users do not need to update coresoi to update the list of emergencies, and coresoi do not need to update itself for each emergency.

Another problem: scrapers are not very professional: once in 500 tentatives, it will randomly fail the connections to the urls. For 20 urls to scrape, more or less 1/25 times updates of the list will fail to scrape all the emergencies and the user will not know this unless a tester function is implemented alongside the scraper.

If the list is centralised it will possible to set English aliases for each emergency, e.g. terremoto aquila = aquila earthquake.

NiccoloSalvini commented 1 year ago

3 possible outputs:

  • No parameter is specified: the function report error and asks to specify an emergency.
  • The user specifies emergency_scenario: the user gets what he asked for OR an error.
  • The user DO NOT specify emergency_scenario, but specifies custom_start_date with a valid ymd: in this case the user get the custom starting date.

I got your point, I agree, we should be making it conditional to user input. My guess is that people would prefer emergency_name instead of declaring dates since you don't have to cope with converting to ymd and seems more intuitive, but at least you offer an option. First thought is If you leave the user choosing a date, then how we would be filling the emergency_name and emergency_id if not specified? e.g. let's say 6/2/1994 (no emergency_name, random date, my birthday) what's the emergency_name and emergency_id for that?

On top of that I believe that moving 1,2,3 +/- days on the official date would most likely will not impact that much on the indicator estimates, but that's just my assumption (we should be writing a test for that, testing for the indicator consistency on an emergency time interval +/- 1 weeks from the emergency start) assuming I got dates correctly.


I suggest to lowercase any element of the list, then to force a tolower() on the input.

I am a little bit skeptical on that. That would increase the chance of finding the right match at the cost of having things lowercase in the ouput, which is not that formally correct. We may think to str_to_title after the match but that seems a little to much engineered. I implemented agrep which computes semantic distance on a max dist of .3 and then gets the most likely match result. It does behave well.


I would suggest an output like:

Error in "emergency_scenario": 'terremoto acuila' is not a scenario. Suggested scenario: 'terremoto aquila'. Scenarios >available: "covid19", "terremoto aquila", "ucraine-russia war"

This is more informative then the one I coded. I'll do it!


Scraping imo not worth: ...

Totally on your side. it looks too much effort for relatively low impact.

giuliogcantone commented 1 year ago

conditional to user input. My guess is that people would prefer

I suggest a UX where only if the user browses the helper, he do see the hidden parameter custom_start_date. In the helper can be said that if you want an emergency, you can leave it blank.

Indeed, I implicitly suggested that emergecy_scenario overwrites inputs of custom_start_date. Only if emergency_scenario is blank custom_starting_date is not blank then the functions should try to force a custom ymd based on the input, in all other cases error or a pre-set date.

I believe that in the extended, release of coresoi the user would like to custom own inputs. E.g. set a date that is not an emergency but, e.g. the election of a politician.