dickoa / rhdx

R package to interact with the Humanitarian Data Exchange portal - http://dickoa.gitlab.io/rhdx/
Other
26 stars 6 forks source link

read HERA data from xls or ; delimited csv #8

Closed andysouth closed 3 years ago

andysouth commented 3 years ago

Hi Ahmadou,

I'm trying to read the HERA subnational covid data via rhdx. https://data.humdata.org/organization/hera-humanitarian-emergency-response-africa

The csv data are delimited by ;

I tried this which does find the data but ignores delim and reads to a single column

df1 <- search_datasets("hera", rows = 2) %>% 
  pluck(1) %>% ## select the first dataset
  get_resource(2) %>% ## 2nd resource is csv
  read_resource(delim=';') 

I also tried reading the xls but that seems to put the column names into the first row.

df1 <- search_datasets("hera", rows = 2) %>% 
  pluck(1) %>% ## select the first dataset
  get_resource(1) %>% ## 1st resource is xls
  read_resource() 

Thanks! Andy

dickoa commented 3 years ago

Hi Andy, I was also bitten by this one recently and I'm thinking about exposing some extra parameters for each reader for extra flexibility. I first wanted to check if HDX/HERA are willing to change the format metadata from CSV to CSV2. It's not pretty, but I used download_resource(folder = "folder_path") and used read_csv2 but we could support something more generic. Thanks

andysouth commented 3 years ago

OK thanks.

I think it might be useful to be able to cope with different delimiters if possible.

I wasn't sure why the delim=';' doesn't get passed onto read_csv in the ..., which from a quick look at your code is what I expected.

I'm currently trying reading in the xls and using this to correct.

  #some, but not all xls files, read rownames into first column
  if (df1[[1,1]] == 'ID')
  {
    #set column names from 1st row
    names(df1) <- as.character(df1[1,])
    #remove first row
    df1 <- df1[-1,]    
  }

There are some other inconsistencies between countries to patch too.

dickoa commented 3 years ago

Good suggestion, can you try the latest version I pushed. I think it allow more flexibility for parsing csv now.

library(rhdx)
library(tidyverse)

pull_dataset("mauritania_covid19_subnational") %>%
  get_resources(format = "csv") %>%
  pluck(1) %>%
  read_resource(delim = ";", locale = locale(decimal_mark = ","))

## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   ID = col_double(),
##   DATE = col_character(),
##   ISO_3 = col_character(),
##   PAYS = col_character(),
##   ID_PAYS = col_double(),
##   REGION = col_character(),
##   ID_REGION = col_double(),
##   CONTAMINES = col_double(),
##   DECES = col_double(),
##   GUERIS = col_double(),
##   CONTAMINES_FEMME = col_double(),
##   CONTAMINES_HOMME = col_double(),
##   CONTAMINES_GENRE_NON_SPECIFIE = col_double(),
##   SOURCE = col_character()
## )

## # A tibble: 3,850 x 14
##       ID DATE  ISO_3 PAYS  ID_PAYS REGION ID_REGION CONTAMINES
##    <dbl> <chr> <chr> <chr>   <dbl> <chr>      <dbl>      <dbl>
##  1     1 13/0… MRT   Maur…       6 Adrar         59          0
##  2     2 13/0… MRT   Maur…       6 Assaba        60          0
##  3     3 13/0… MRT   Maur…       6 Brakna        61          0
##  4     4 13/0… MRT   Maur…       6 Dakhl…        62          0
##  5     5 13/0… MRT   Maur…       6 Gorgol        63          0
##  6     6 13/0… MRT   Maur…       6 Guidi…        64          0
##  7     7 13/0… MRT   Maur…       6 Hodh …        65          0
##  8     8 13/0… MRT   Maur…       6 Hodh …        66          0
##  9     9 13/0… MRT   Maur…       6 Inchi…        67          0
## 10    10 13/0… MRT   Maur…       6 Nouak…        68          1
## # … with 3,840 more rows, and 6 more variables: DECES <dbl>,
## #   GUERIS <dbl>, CONTAMINES_FEMME <dbl>,
## #   CONTAMINES_HOMME <dbl>, CONTAMINES_GENRE_NON_SPECIFIE <dbl>,
## #   SOURCE <chr>

I also added locale just in case it's pure csv2.

For xls file, let me have a look, I don't have this issue when I read it can you also re-check reading the xls file with the latest version. This is what I have

pull_dataset("mauritania_covid19_subnational") %>%
  get_resources(format = "xls") %>%
  pluck(1) %>%
  read_resource()
## Reading sheet:  Feuil1 
## # A tibble: 3,850 x 14
##       ID DATE                ISO_3 PAYS  ID_PAYS REGION ID_REGION
##    <dbl> <dttm>              <chr> <chr>   <dbl> <chr>      <dbl>
##  1     1 2020-03-13 00:00:00 MRT   Maur…       6 Adrar         59
##  2     2 2020-03-13 00:00:00 MRT   Maur…       6 Assaba        60
##  3     3 2020-03-13 00:00:00 MRT   Maur…       6 Brakna        61
##  4     4 2020-03-13 00:00:00 MRT   Maur…       6 Dakhl…        62
##  5     5 2020-03-13 00:00:00 MRT   Maur…       6 Gorgol        63
##  6     6 2020-03-13 00:00:00 MRT   Maur…       6 Guidi…        64
##  7     7 2020-03-13 00:00:00 MRT   Maur…       6 Hodh …        65
##  8     8 2020-03-13 00:00:00 MRT   Maur…       6 Hodh …        66
##  9     9 2020-03-13 00:00:00 MRT   Maur…       6 Inchi…        67
## 10    10 2020-03-13 00:00:00 MRT   Maur…       6 Nouak…        68
## # … with 3,840 more rows, and 7 more variables:
## #   CONTAMINES <dbl>, DECES <dbl>, GUERIS <dbl>,
## #   CONTAMINES_FEMME <dbl>, CONTAMINES_HOMME <dbl>,
## #   CONTAMINES_GENRE_NON_SPECIFIE <dbl>, SOURCE <chr>

Thanks

andysouth commented 3 years ago

Thanks, great csv is working well now for all countries, so I can do what I wanted to.

xls is strange, it works for e.g. mauritania (as you show) and senegal

but not e.g. for nigeria, ghana, niger

pull_dataset("ghana_covid19_subnational") %>%
     get_resources(format = "xls") %>%
     pluck(1) %>%
     read_resource()

Error: Not an HDX Resource object!

pull_dataset("niger_covid19_subnational") %>%
     get_resources(format = "xls") %>%
     pluck(1) %>%
     read_resource()

Error: Not an HDX Resource object!

dickoa commented 3 years ago

The issue is the fact they're not consistent, they share xls for some data and xlsx for others, skip a line in one and not other.

pull_dataset("niger_covid19_subnational") %>%
     get_resources()
##  [[1]]
## <HDX Resource> a02fdda1-2c3a-4458-ab24-e2b2de91d1f2 
##   Name: GHA_Subnational_Covid19_HERA.xlsx
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 239636
##   Format: XLSX

## [[2]]
## <HDX Resource> 1a9d2a9b-36a0-46bc-94df-b1a1d4f62bd5 
##   Name: GHA_Subnational_Covid19_HERA.csv
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 386326
##   Format: CSV

## [[3]]
## <HDX Resource> 632fbd66-e9a7-4f75-b8e4-c1074d3372fa 
##   Name: GHA_Subnational_Covid19_HXL_HERA.csv
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 386513
##   Format: CSV

I think HERA or HDX can fix this, it will be easier to have a single format i.e xlsx and not having to skip a line for some countries and not the others. These are good use case for pushing for more consistency in data format.

pull_dataset("ghana_covid19_subnational") %>%
     get_resources(format = "xlsx") %>%
     pluck(1) %>%
     read_resource(skip = 1L)

## Reading sheet:  Feuille 1 
## # A tibble: 4,641 x 14
##       ID DATE  ISO_3 PAYS  ID_PAYS REGION ID_REGION CONTAMINES
##    <dbl> <chr> <chr> <chr>   <dbl> <chr>      <dbl>      <dbl>
##  1     1 12/0… GHA   Ghana       4 Ashan…        26          0
##  2     2 12/0… GHA   Ghana       4 Brong…        27          0
##  3     3 12/0… GHA   Ghana       4 Centr…        28          0
##  4     4 12/0… GHA   Ghana       4 Upper…        29          0
##  5     5 12/0… GHA   Ghana       4 Great…        30          2
##  6     6 12/0… GHA   Ghana       4 North…        31          0
##  7     7 12/0… GHA   Ghana       4 Weste…        32          0
##  8     8 12/0… GHA   Ghana       4 Easte…        33          0
##  9     9 12/0… GHA   Ghana       4 Upper…        34          0
## 10    10 12/0… GHA   Ghana       4 Volta         35          0
## # … with 4,631 more rows, and 6 more variables: DECES <dbl>,
## #   GUERIS <dbl>, CONTAMINES_FEMME <chr>,
## #   CONTAMINES_HOMME <chr>, CONTAMINES_GENRE_NON_SPECIFIE <dbl>,
## #   SOURCE <chr>
## Warning messages:
## 1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  :
##   Expecting numeric in I1445 / R1445C9: got 'Null'
## 2: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  :
##   Expecting numeric in I1446 / R1446C9: got 'Null'
andysouth commented 3 years ago

Aha thanks, I hadn't noticed the xls, xlsx thing. I'll feed this back to HERA. I'll close this issue now.