afrimapr / afrihealthsites

access to geographic locations of african health sites from different sources
https://wellcomeopenresearch.org/articles/5-157
GNU General Public License v3.0
31 stars 4 forks source link

[Existing data]KEMRI/WHO import differs from original spreadsheet #5

Closed anelda closed 3 years ago

anelda commented 4 years ago

Which dataset KEMRI/WHO

Short description of the error or suggestion When I import the original spreadsheet with read_excel it and filter Country for 'South Africa' there are 4303 observations but when I import the same dataset via afrihealthsites I find 4288 observations.

> ah_kemri_who_tb <- afrihealthsites(country="south africa", datasource = "who")
> dim(ah_kemri_who_tb)
[1] 4288    8
> kemri_excel <- read_excel('data/raw_data/who-cds-gmp-2019-01-eng.xlsx') %>% filter(Country == 'South Africa')

> dim(kemri_excel)
[1] 4303    8

Suggested actions

I'm trying to figure out what is going on and will report back here.

anelda commented 4 years ago

The following facilities are missing from the SF afrihealthsites import because they don't have coordinates in the original Excel spreadsheet:

missing_data_from_afrihealthsites_kemri.xlsx

andysouth commented 4 years ago

Thankyou @anelda

You are right, facilities without coordinates are missing from the stored data. This is related to #4 the data are currently stored in the package as a sf object which cannot hold items without coordinates. I'll look into storing the data as a dataframe instead.

For reference the reproducible code to download and store the data is in the data-raw folder of the package here 👍 https://github.com/afrimapr/afrihealthsites/blob/6432b5ac9aa49a9ec802c6926cb015cb653994be/data-raw/sf_who_sites.R

anelda commented 4 years ago

Thanks! It makes sense that an SF object will not contain observations without coordinates. I notice that the KEMRI data contains 2350 observations without coordinate details.

Maybe it makes more sense for afrihealthsites to import by default as datatable and have a function to convert to sf with very clear indication of the obs that are lost in the conversion? People may want to do non-map related analysis? Or combine with other table-like datasets?

I'm wondering if there's an opportunity here to help people to improve the data and push back to healthsites.io or other sources from the package?

andysouth commented 4 years ago

Can you check this now does what you would expect ?

# to return raw dataframe for WHO data including any rows with no coordinates
dfzaf <- afrihealthsites("south africa", datasource='who', plot=FALSE, returnclass='dataframe')

I have kept the default to return as sf because mostly we are interested in doing spatial things.

Also its get's a bit tricky because other sources e.g. healthsites.io from rhealthsites are already sf.

We can revisit if needed.

anelda commented 4 years ago

This is perfect! Thanks!

I also ran this on the healthsites.io data but it returns a vector for geometry in stead of two columns for lat and long:

> dfzaf_healthsites <- afrihealthsites("south africa", datasource='healthsites', plot=FALSE, returnclass='dataframe')
> select(dfzaf_healthsites, geometry)
Simple feature collection with 2064 features and 0 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 17.06561 ymin: -34.59043 xmax: 32.75507 ymax: -22.34141
geographic CRS: WGS 84
# A tibble: 2,064 x 1
               geometry
            <POINT [°]>
 1 (18.84201 -33.97814)
 2 (28.15224 -26.16084)
 3 (27.92697 -26.10441)
 4 (31.03788 -23.92632)
 5 (18.50614 -33.86543)
 6   (28.267 -25.76763)
 7  (25.1132 -30.71224)
anelda commented 4 years ago

Also its get's a bit tricky because other sources e.g. healthsites.io from rhealthsites are already sf.

I suppose that's because healthsites.io provide their data as shapefile which means they will only provide data that definitely have lat/long information?

anelda commented 4 years ago

This was fixed by implementing the option to import the data as dataframe. We can probably close this issue