Below we will outline some instructions for obtaining your data directly from your Go.Data instance application programming interface (API) using R, for further data cleaning and analysis. This process will provide you with cleaned, flattened excel files and sample dashboard output that you can adapt for your purposes. Although there are multiple ways to retrieve the data collections including installing and connecting directly to the MongoDB database on your machine, this SOP outlines how to do this simply using the open-source software R – where advanced R skills are not required.
In order for the scripts to work it is essential for you to have the same folder hierarchy and contents. Your folder directory should include:
Please be sure that you check your folder contents are up to date with what is living on Github. If you are able to connect dynamically to Github in R via Git to pull most recent version, this is preferred. If you don't feel comfortable with this, you can simply copy/paste the script contents into your local folder hierarchy. The rationale behind this folder hierarchy and set-up was borrowed from RECON's reportfactory templates repository (https://github.com/reconhub/report_factories_templates) and has been simplified for our purposes.
Open up R project by double-clicking on godata.Rroj
Navigate to 00_set_credentials.R
in the report_sources folder and click to open it in your R console.
Where indicated in the script, fill in the appropriate URL, your Go.Data username and password, and outbreak_id of interest.
TIP: In order to obtain your outbreak ID, navigate to View Outbreak in Go.Data and you can find it in the URL. You can only extract data from one outbreak at a time. Before running, ensure this is your active outbreak in the platform.
You then install godataR
by running
#Install package
devtools::install_github("WorldHealthOrganization/godataR")
Running this script will import data into R environment from your Go.Data API.
Navigate to 01_data_import_api.R in the report_sources folder and click to open it in your R console.
Run the script by clicking "Source".
Once the script has succesfully completed, you should have created several data frames in your R global environment that will be used in subsequent cleaning scripts.
NOTE: please switch your language to English in your Go.Data instance before running this API script, to ensure core data elements are all brought back in a consistent form.
The dataframes as retrieved straight from API can contain some nested arrays in lists; for fields that can have multiple responses for one case or contact (i.e. more than one address can be registered if person has moved; repeat hospitalizations can be recorded; followUp history is stored). The cleaning script helps to properly un-nest relevant fields and do some basic data manipulation to these data frames before exporting to .CSV (or prepping for additional analysis in R).
Navigate to 02_clean_data_api.R, (also in report_sources folder).
Run the script by clicking "Source".
This will result in the cleaned .csv files data "clean"" folder, with format matching the pattern below, updated each time you run the script to contain the most recent data.
You will also have .rds files in the data "clean" folder (i.e. contacts_clean.rds; cases_clean.rds) This condensed format will be used for subsequent R dashboards scripts since it is more performant and perserves language characters better.
For good measure, you will also have .rds files in the data "raw" folder (i.e. contacts.rds; cases.rds) that mimic exactly as they were retrieved from API, if you needed these for verification or further use.
NOTE: these cleaning scripts focus on the CORE data variables and not custom questionnaire variables, as questionnaires are configurable for each country or institution deploying Go.Data. No core data elements (those living outside of questionnaires) should need updating in terms of coding; however, if you would like to pull in additional questionnaire data elements you may need to slightly modify this script to accommodate these extra fields. Additionally, it is possible that your location hierarchy or team structure may vary in your deployment setting (I.e. supervisor registered at a different admin level) so changes may need to be made to the location cleaning scripts. Please see the section Further tips on data extraction/cleaning from API at the bottom fo this SOP for more details.
The cleaned data-sets will now be much easier to do additional analysis whether inside or outside of R.
We have created some sample scripts to get you started in some basic dashboard analyses (see, for example, 03_daily_summary_dashboard.Rmd for a ready-made HTML dashboard that will give you stats on a range of operational metrics to be monitored by supervisor and contact tracer on a daily basis).
Screenshots below show some of these graphics, such as contact follow-up status by a given admin level. These will be printed to the report_outputs folder.
questionnaireAnswers
are taken out of data frames, for this template starter cleaning script; since we only are un-nesting and cleaning the list fields (like addresses
or dateRanges
) that every Go.Data project will have, in separate data frames, and then joining them to the cleaned case data frame.cases_clean <- cases %>%
filter(deleted == FALSE | is.na(deleted)) %>% # Remove all deleted cases
select_if(negate(is.list)) %>% # Take out all list fields since unnesting causes duplicate rows if >1 response per case/contact
select(-contains("questionnaireAnswers")) # Take out all that are not core variables (questionnaire) for same reason above and need to un-nest
However we know projects will still want to extract their questionnaire data and that is easy! To do so, you will have to either specify exactly which variable, and unnest it...as shown below... NOTE Better to do this as a separate dataframe, and then join to core case variables to avoid cases being duplicated or dropped from clean case linelist, as shown below. When questionnaires are not filled for a given case, the do not have the var in their JSON and thus will not apper in the questionnaire dataframe.
### EXAMPLE - retrieving and unnesting Go.Data Questionnaire variables
coltoretrieve = colnames(cases)[grep('questionnaire', colnames(cases))] #get columns from the questionnaire
questionnaire.list = cases[coltoretrieve]
cases_questionnaire_unnest <- cases %>%
select(id, #get uuid for later join !
all_of(coltoretrieve)) %>% #questionnaire columns
mutate_if(is.list, simplify_all) %>%
unnest(colnames(questionnaire.list),
names_sep = ".") # for cases that had no questionnaire filled, they will not appear after un-nesting
# thus why we now have the uuid for joining the q fields you need, per case, ensuring no duplication
cases_clean <- cases %>%
filter(deleted == FALSE | is.na(deleted)) %>%
select_if(negate(is.list)) %>% # Remove all listed/nested fields from overall cases dataframe
select(-contains("questionnaireAnswers")) %>% # Remove all Questionnaire vars from overall cases dataframe
left_join(cases_questionnaire_unnest, by="id") %>% # Join back in flattened de-duped questionnaire fields above, using case id
rename_at(vars(starts_with("questionnaireAnswers")), # Rename so it is easier to read
funs(str_replace(., "questionnaireAnswers", "Q")))
Say there is a variable signs_and_symptoms
that is multi-select. When un-nested, it will be in list form.
You can utilize unnest_wider
to separate elements (symptoms) of list into separate column, and pivot to get the data into a more workable format.
symptoms <-
cases_questionnaire_unnest %>%
select(id, questionnaireAnswers.signs_and_symptoms.value) %>%
unnest_wider(questionnaireAnswers.signs_and_symptoms.value, names_sep = "_") %>%
pivot_longer(-id,
names_to = "reported",
values_to = "symptom",
values_drop_na = TRUE
) %>%
mutate(reported = case_when(!is.na(symptom) ~ TRUE,
TRUE ~ FALSE)) %>%
pivot_wider(names_from = "symptom",
values_from = "reported")