Country specific wrappers for JHU/Google data

seabbs commented 2 years ago

At the moment it is a little clunky, tedious and a bit opaque to access the regional data imported using the JHU and Google wrappers. It would be really nice to improve access to this data on the same level as other data sources. One of the current issues is discoverability as we don't broadcast to users what the JHU/Google support until they access those classes and these don't work via get_available_data or via get_regional_data. It is also very slow to clean and processs these big data sets which is a large waste of time if only interested in a single region.

This can be done in several ways:

Manually add the source by looking at where JHU/Google get their data from
Import via our current integrations with JHU and Google (using a child class with more specifc defaults) and add documentation based on the original source
Automagically import and write new clases using a script in the data-raw.

Personally, I think manually adding using the JHU/Google integrations is probably the way to go in terms of giving the best documentation and ease of use.

Tagging for discussions @epiforecasts/covidregionaldata @Bisaloo (appreciate your thoughts). I am happy to look at how to do this but it might take a while so also very happy for anyone else interested to have crack.

Example of accessing the data for a JHU supported region and Google supported region

library(covidregionaldata)

jhu <- JHU$new(level = "2")
jhu$get()
#> Downloading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
#> Rows: 279 Columns: 566
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr   (2): Province/State, Country/Region
#> dbl (564): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Downloading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv
#> Rows: 279 Columns: 566
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr   (2): Province/State, Country/Region
#> dbl (564): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Downloading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv
#> Rows: 264 Columns: 566
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr   (2): Province/State, Country/Region
#> dbl (564): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Cleaning data
#> Processing data
jhu$available_regions()
#> [1] "Australia"      "Canada"         "China"          "Denmark"       
#> [5] "France"         "Netherlands"    "New Zealand"    "United Kingdom"
jhu$filter("China")
#> Filtering data to: China
jhu$process()
#> Processing data
jhu$data
#> $raw
#> $raw$daily_confirmed
#> # A tibble: 279 × 566
#>    `Province/State`  `Country/Region`   Lat   Long `1/22/20` `1/23/20` `1/24/20`
#>    <chr>             <chr>            <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
#>  1 <NA>              Afghanistan       33.9  67.7          0         0         0
#>  2 <NA>              Albania           41.2  20.2          0         0         0
#>  3 <NA>              Algeria           28.0   1.66         0         0         0
#>  4 <NA>              Andorra           42.5   1.52         0         0         0
#>  5 <NA>              Angola           -11.2  17.9          0         0         0
#>  6 <NA>              Antigua and Bar…  17.1 -61.8          0         0         0
#>  7 <NA>              Argentina        -38.4 -63.6          0         0         0
#>  8 <NA>              Armenia           40.1  45.0          0         0         0
#>  9 Australian Capit… Australia        -35.5 149.           0         0         0
#> 10 New South Wales   Australia        -33.9 151.           0         0         0
#> # … with 269 more rows, and 559 more variables: 1/25/20 <dbl>, 1/26/20 <dbl>,
#> #   1/27/20 <dbl>, 1/28/20 <dbl>, 1/29/20 <dbl>, 1/30/20 <dbl>, 1/31/20 <dbl>,
#> #   2/1/20 <dbl>, 2/2/20 <dbl>, 2/3/20 <dbl>, 2/4/20 <dbl>, 2/5/20 <dbl>,
#> #   2/6/20 <dbl>, 2/7/20 <dbl>, 2/8/20 <dbl>, 2/9/20 <dbl>, 2/10/20 <dbl>,
#> #   2/11/20 <dbl>, 2/12/20 <dbl>, 2/13/20 <dbl>, 2/14/20 <dbl>, 2/15/20 <dbl>,
#> #   2/16/20 <dbl>, 2/17/20 <dbl>, 2/18/20 <dbl>, 2/19/20 <dbl>, 2/20/20 <dbl>,
#> #   2/21/20 <dbl>, 2/22/20 <dbl>, 2/23/20 <dbl>, 2/24/20 <dbl>, …
#> 
#> $raw$daily_deaths
#> # A tibble: 279 × 566
#>    `Province/State`  `Country/Region`   Lat   Long `1/22/20` `1/23/20` `1/24/20`
#>    <chr>             <chr>            <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
#>  1 <NA>              Afghanistan       33.9  67.7          0         0         0
#>  2 <NA>              Albania           41.2  20.2          0         0         0
#>  3 <NA>              Algeria           28.0   1.66         0         0         0
#>  4 <NA>              Andorra           42.5   1.52         0         0         0
#>  5 <NA>              Angola           -11.2  17.9          0         0         0
#>  6 <NA>              Antigua and Bar…  17.1 -61.8          0         0         0
#>  7 <NA>              Argentina        -38.4 -63.6          0         0         0
#>  8 <NA>              Armenia           40.1  45.0          0         0         0
#>  9 Australian Capit… Australia        -35.5 149.           0         0         0
#> 10 New South Wales   Australia        -33.9 151.           0         0         0
#> # … with 269 more rows, and 559 more variables: 1/25/20 <dbl>, 1/26/20 <dbl>,
#> #   1/27/20 <dbl>, 1/28/20 <dbl>, 1/29/20 <dbl>, 1/30/20 <dbl>, 1/31/20 <dbl>,
#> #   2/1/20 <dbl>, 2/2/20 <dbl>, 2/3/20 <dbl>, 2/4/20 <dbl>, 2/5/20 <dbl>,
#> #   2/6/20 <dbl>, 2/7/20 <dbl>, 2/8/20 <dbl>, 2/9/20 <dbl>, 2/10/20 <dbl>,
#> #   2/11/20 <dbl>, 2/12/20 <dbl>, 2/13/20 <dbl>, 2/14/20 <dbl>, 2/15/20 <dbl>,
#> #   2/16/20 <dbl>, 2/17/20 <dbl>, 2/18/20 <dbl>, 2/19/20 <dbl>, 2/20/20 <dbl>,
#> #   2/21/20 <dbl>, 2/22/20 <dbl>, 2/23/20 <dbl>, 2/24/20 <dbl>, …
#> 
#> $raw$daily_recovered
#> # A tibble: 264 × 566
#>    `Province/State`  `Country/Region`   Lat   Long `1/22/20` `1/23/20` `1/24/20`
#>    <chr>             <chr>            <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
#>  1 <NA>              Afghanistan       33.9  67.7          0         0         0
#>  2 <NA>              Albania           41.2  20.2          0         0         0
#>  3 <NA>              Algeria           28.0   1.66         0         0         0
#>  4 <NA>              Andorra           42.5   1.52         0         0         0
#>  5 <NA>              Angola           -11.2  17.9          0         0         0
#>  6 <NA>              Antigua and Bar…  17.1 -61.8          0         0         0
#>  7 <NA>              Argentina        -38.4 -63.6          0         0         0
#>  8 <NA>              Armenia           40.1  45.0          0         0         0
#>  9 Australian Capit… Australia        -35.5 149.           0         0         0
#> 10 New South Wales   Australia        -33.9 151.           0         0         0
#> # … with 254 more rows, and 559 more variables: 1/25/20 <dbl>, 1/26/20 <dbl>,
#> #   1/27/20 <dbl>, 1/28/20 <dbl>, 1/29/20 <dbl>, 1/30/20 <dbl>, 1/31/20 <dbl>,
#> #   2/1/20 <dbl>, 2/2/20 <dbl>, 2/3/20 <dbl>, 2/4/20 <dbl>, 2/5/20 <dbl>,
#> #   2/6/20 <dbl>, 2/7/20 <dbl>, 2/8/20 <dbl>, 2/9/20 <dbl>, 2/10/20 <dbl>,
#> #   2/11/20 <dbl>, 2/12/20 <dbl>, 2/13/20 <dbl>, 2/14/20 <dbl>, 2/15/20 <dbl>,
#> #   2/16/20 <dbl>, 2/17/20 <dbl>, 2/18/20 <dbl>, 2/19/20 <dbl>, 2/20/20 <dbl>,
#> #   2/21/20 <dbl>, 2/22/20 <dbl>, 2/23/20 <dbl>, 2/24/20 <dbl>, …
#> 
#> 
#> $clean
#> # A tibble: 160,170 × 10
#>    date       level_1_region level_1_region_code level_2_region level_2_region_…
#>    <date>     <chr>          <chr>               <chr>                     <dbl>
#>  1 2020-01-22 Afghanistan    AFG                 <NA>                         NA
#>  2 2020-01-23 Afghanistan    AFG                 <NA>                         NA
#>  3 2020-01-24 Afghanistan    AFG                 <NA>                         NA
#>  4 2020-01-25 Afghanistan    AFG                 <NA>                         NA
#>  5 2020-01-26 Afghanistan    AFG                 <NA>                         NA
#>  6 2020-01-27 Afghanistan    AFG                 <NA>                         NA
#>  7 2020-01-28 Afghanistan    AFG                 <NA>                         NA
#>  8 2020-01-29 Afghanistan    AFG                 <NA>                         NA
#>  9 2020-01-30 Afghanistan    AFG                 <NA>                         NA
#> 10 2020-01-31 Afghanistan    AFG                 <NA>                         NA
#> # … with 160,160 more rows, and 5 more variables: cases_total <dbl>,
#> #   deaths_total <dbl>, recovered_total <dbl>, Lat <dbl>, Long <dbl>
#> 
#> $filtered
#> # A tibble: 20,232 × 10
#>    date       level_1_region level_1_region_code level_2_region level_2_region_…
#>    <date>     <chr>          <chr>               <chr>                     <dbl>
#>  1 2020-01-22 China          CHN                 Anhui                        NA
#>  2 2020-01-23 China          CHN                 Anhui                        NA
#>  3 2020-01-24 China          CHN                 Anhui                        NA
#>  4 2020-01-25 China          CHN                 Anhui                        NA
#>  5 2020-01-26 China          CHN                 Anhui                        NA
#>  6 2020-01-27 China          CHN                 Anhui                        NA
#>  7 2020-01-28 China          CHN                 Anhui                        NA
#>  8 2020-01-29 China          CHN                 Anhui                        NA
#>  9 2020-01-30 China          CHN                 Anhui                        NA
#> 10 2020-01-31 China          CHN                 Anhui                        NA
#> # … with 20,222 more rows, and 5 more variables: cases_total <dbl>,
#> #   deaths_total <dbl>, recovered_total <dbl>, Lat <dbl>, Long <dbl>
#> 
#> $processed
#> # A tibble: 20,232 × 17
#>    date       country iso_3166_1_alpha_3 region    iso_code cases_new cases_total
#>    <date>     <chr>   <chr>              <chr>        <dbl>     <dbl>       <dbl>
#>  1 2020-01-22 China   CHN                Anhui           NA         1           1
#>  2 2020-01-22 China   CHN                Beijing         NA        14          14
#>  3 2020-01-22 China   CHN                Chongqing       NA         6           6
#>  4 2020-01-22 China   CHN                Fujian          NA         1           1
#>  5 2020-01-22 China   CHN                Gansu           NA         0           0
#>  6 2020-01-22 China   CHN                Guangdong       NA        26          26
#>  7 2020-01-22 China   CHN                Guangxi         NA         2           2
#>  8 2020-01-22 China   CHN                Guizhou         NA         1           1
#>  9 2020-01-22 China   CHN                Hainan          NA         4           4
#> 10 2020-01-22 China   CHN                Hebei           NA         1           1
#> # … with 20,222 more rows, and 10 more variables: deaths_new <dbl>,
#> #   deaths_total <dbl>, recovered_new <dbl>, recovered_total <dbl>,
#> #   hosp_new <dbl>, hosp_total <dbl>, tested_new <dbl>, tested_total <dbl>,
#> #   Lat <dbl>, Long <dbl>

google <- Google$new(level = "2", get = TRUE) 
#> Downloading data from https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv
#> Rows: 7534538 Columns: 10
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): key
#> dbl  (8): new_confirmed, new_deceased, new_recovered, new_tested, total_conf...
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Downloading data from https://storage.googleapis.com/covid19-open-data/v2/hospitalizations.csv
#> Rows: 1003422 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): key
#> dbl  (9): new_hospitalized, total_hospitalized, current_hospitalized, new_in...
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Downloading data from https://storage.googleapis.com/covid19-open-data/v2/index.csv
#> Rows: 22578 Columns: 15
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (14): key, place_id, wikidata, datacommons, country_code, country_name, ...
#> dbl  (1): aggregation_level
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Cleaning data
#> Processing data
google$available_regions()
#>  [1] "Switzerland"                      "Argentina"                       
#>  [3] "Brazil"                           "Spain"                           
#>  [5] "Germany"                          "France"                          
#>  [7] "Indonesia"                        "Thailand"                        
#>  [9] "United States of America"         "Japan"                           
#> [11] "South Korea"                      "China"                           
#> [13] "Ukraine"                          "Philippines"                     
#> [15] "Australia"                        "Canada"                          
#> [17] "Taiwan"                           "United Kingdom"                  
#> [19] "Sweden"                           "Estonia"                         
#> [21] "Mexico"                           "Italy"                           
#> [23] "Austria"                          "Pakistan"                        
#> [25] "Portugal"                         "Belgium"                         
#> [27] "Czech Republic"                   "Chile"                           
#> [29] "Peru"                             "Colombia"                        
#> [31] "Israel"                           "Netherlands"                     
#> [33] "India"                            "Poland"                          
#> [35] "Haiti"                            "Norway"                          
#> [37] "Afghanistan"                      "Mozambique"                      
#> [39] "Russia"                           "South Africa"                    
#> [41] "Sierra Leone"                     "Romania"                         
#> [43] "Democratic Republic of the Congo" "Venezuela"                       
#> [45] "Sudan"                            "Kenya"                           
#> [47] "Bangladesh"                       "Libya"
google$filter("portugal")
#> Filtering data to: Portugal
google$process()
#> Processing data

^{Created on 2021-08-06 by the reprex package (v2.0.0)}

Example for a fully supported country:

library(covidregionaldata)

italy <- Italy$new(get = TRUE)
#> Downloading data from https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv
#> Rows: 11109 Columns: 30
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr   (8): stato, codice_regione, denominazione_regione, note, note_test, no...
#> dbl  (21): lat, long, ricoverati_con_sintomi, terapia_intensiva, totale_ospe...
#> dttm  (1): data
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Cleaning data
#> Processing data
italy$available_regions()       
#>  [1] "Abruzzo"               "Basilicata"            "Calabria"             
#>  [4] "Campania"              "Emilia-Romagna"        "Friuli Venezia Giulia"
#>  [7] "Lazio"                 "Liguria"               "Lombardia"            
#> [10] "Marche"                "Molise"                "Piemonte"             
#> [13] "Puglia"                "Sardegna"              "Sicilia"              
#> [16] "Toscana"               "Trentino-Alto Adige"   "Umbria"               
#> [19] "Valle d'Aosta"         "Veneto"
italy$supported_levels      
#> [[1]]
#> [1] "1"
italy$data
#> $raw
#> $raw$main
#> # A tibble: 11,109 × 30
#>    data                stato codice_regione denominazione_regione   lat  long
#>    <dttm>              <chr> <chr>          <chr>                 <dbl> <dbl>
#>  1 2020-02-24 18:00:00 ITA   13             Abruzzo                42.4 13.4 
#>  2 2020-02-24 18:00:00 ITA   17             Basilicata             40.6 15.8 
#>  3 2020-02-24 18:00:00 ITA   18             Calabria               38.9 16.6 
#>  4 2020-02-24 18:00:00 ITA   15             Campania               40.8 14.3 
#>  5 2020-02-24 18:00:00 ITA   08             Emilia-Romagna         44.5 11.3 
#>  6 2020-02-24 18:00:00 ITA   06             Friuli Venezia Giulia  45.6 13.8 
#>  7 2020-02-24 18:00:00 ITA   12             Lazio                  41.9 12.5 
#>  8 2020-02-24 18:00:00 ITA   07             Liguria                44.4  8.93
#>  9 2020-02-24 18:00:00 ITA   03             Lombardia              45.5  9.19
#> 10 2020-02-24 18:00:00 ITA   11             Marche                 43.6 13.5 
#> # … with 11,099 more rows, and 24 more variables: ricoverati_con_sintomi <dbl>,
#> #   terapia_intensiva <dbl>, totale_ospedalizzati <dbl>,
#> #   isolamento_domiciliare <dbl>, totale_positivi <dbl>,
#> #   variazione_totale_positivi <dbl>, nuovi_positivi <dbl>,
#> #   dimessi_guariti <dbl>, deceduti <dbl>, casi_da_sospetto_diagnostico <dbl>,
#> #   casi_da_screening <dbl>, totale_casi <dbl>, tamponi <dbl>,
#> #   casi_testati <dbl>, note <chr>, ingressi_terapia_intensiva <dbl>, …
#> 
#> 
#> $clean
#> # A tibble: 10,580 × 6
#>    date       level_1_region        level_1_region_code cases_total deaths_total
#>    <date>     <chr>                 <chr>                     <dbl>        <dbl>
#>  1 2020-02-24 Abruzzo               IT-65                         0            0
#>  2 2020-02-24 Basilicata            IT-77                         0            0
#>  3 2020-02-24 Calabria              IT-78                         0            0
#>  4 2020-02-24 Campania              IT-72                         0            0
#>  5 2020-02-24 Emilia-Romagna        IT-45                        18            0
#>  6 2020-02-24 Friuli Venezia Giulia IT-36                         0            0
#>  7 2020-02-24 Lazio                 IT-62                         3            0
#>  8 2020-02-24 Liguria               IT-42                         0            0
#>  9 2020-02-24 Lombardia             IT-25                       172            6
#> 10 2020-02-24 Marche                IT-57                         0            0
#> # … with 10,570 more rows, and 1 more variable: tested_total <dbl>
#> 
#> $filtered
#> # A tibble: 10,580 × 6
#>    date       level_1_region        level_1_region_code cases_total deaths_total
#>    <date>     <chr>                 <chr>                     <dbl>        <dbl>
#>  1 2020-02-24 Abruzzo               IT-65                         0            0
#>  2 2020-02-24 Basilicata            IT-77                         0            0
#>  3 2020-02-24 Calabria              IT-78                         0            0
#>  4 2020-02-24 Campania              IT-72                         0            0
#>  5 2020-02-24 Emilia-Romagna        IT-45                        18            0
#>  6 2020-02-24 Friuli Venezia Giulia IT-36                         0            0
#>  7 2020-02-24 Lazio                 IT-62                         3            0
#>  8 2020-02-24 Liguria               IT-42                         0            0
#>  9 2020-02-24 Lombardia             IT-25                       172            6
#> 10 2020-02-24 Marche                IT-57                         0            0
#> # … with 10,570 more rows, and 1 more variable: tested_total <dbl>
#> 
#> $processed
#> # A tibble: 10,580 × 13
#>    date       regioni   iso_3166_2 cases_new cases_total deaths_new deaths_total
#>    <date>     <chr>     <chr>          <dbl>       <dbl>      <dbl>        <dbl>
#>  1 2020-02-24 Abruzzo   IT-65              0           0          0            0
#>  2 2020-02-24 Basilica… IT-77              0           0          0            0
#>  3 2020-02-24 Calabria  IT-78              0           0          0            0
#>  4 2020-02-24 Campania  IT-72              0           0          0            0
#>  5 2020-02-24 Emilia-R… IT-45             18          18          0            0
#>  6 2020-02-24 Friuli V… IT-36              0           0          0            0
#>  7 2020-02-24 Lazio     IT-62              3           3          0            0
#>  8 2020-02-24 Liguria   IT-42              0           0          0            0
#>  9 2020-02-24 Lombardia IT-25            172         172          6            6
#> 10 2020-02-24 Marche    IT-57              0           0          0            0
#> # … with 10,570 more rows, and 6 more variables: recovered_new <dbl>,
#> #   recovered_total <dbl>, hosp_new <dbl>, hosp_total <dbl>, tested_new <dbl>,
#> #   tested_total <dbl>

^{Created on 2021-08-06 by the reprex package (v2.0.0)}

RichardMN commented 2 years ago

I'd not looked at these from the user point of view. From your reprex it looks as though it could be fairly straightforward to make something like a wrapper or catcher, or a pair. My general use is to just call get_regional_data for a specific country.

We could have something which would do try_regional_data which would check if there was already a dedicated class, and failing that, see if JHU or google had regional data available for the country, and then wrap up the response to do the get, filter and process.

Similarly we could have survey_available_regional_data which could provide get_available_data then poll JHU and google to see what countries are currently providing regional data (possibly with a check for freshness) and provide a table of countries where regional data is available.

Turning that around, having done the survey, a user could then call try_regional_data to get the regional data without having to think (too much) about where the data is coming from (though some source field should probably be populated).

All the names are just sketches, of course.

github-actions[bot] commented 2 years ago

This issue has been flagged as stale due to lack of activity

seabbs commented 2 years ago

Really like both these ideas Richard, especially the polling of meta-sources. I see the draft PR so will move over there for more comments!

github-actions[bot] commented 2 years ago

This issue has been flagged as stale due to lack of activity

github-actions[bot] commented 2 years ago

This issue has been flagged as stale due to lack of activity

epiforecasts / covidregionaldata

Country specific wrappers for JHU/Google data #406