IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Improve Import from ODK #3754

Open dannyparsons opened 7 years ago

dannyparsons commented 7 years ago

Moved from here: #1728 I have made this into Version 0.4.1, but it could become 0.4.2 once we add that milestone.

Here is information from Sebastian on possible improvements. He will be with me at a (climatic) workshop in mid-September. But I have already been asked for help on ODK in R-Instat by Lesotho Met. So it could be good to include some enhancements by then if possible. It relates to Stats4SD interests.

Here is his message, sent on 24 July 2017: "Roger just told me you are working on adding functions to bring ODK collected data into R-Instat. I sent the attached functions to read and label ODK data into R to Dani around a year ago, but the attached version is more robust.

The main function is odkFormat. It sets the variable format based on the XLSForm type and creates a labeled version of the data set and a list with all codes and labels. Help functions: revarnames (brings the XLSForm into a format odkFormat can use) and readonadata (simply reads in the data).

At the moment, you need the XLSForm and the data in your working folder. The obvious next steps if you decide to use these functions would be: • Pull the data and form from the ODK server (I understand this functionality is already there in R-Instat?) • Use the codebook to label the data in R-Instat

These functions are probably not written or documented the way a programmer would do it, so feel welcome to ask me if you need any guidance using them. Big advantage is, I have used and tested them on a number of data sets, so they run pretty robust." https://github.com/africanmathsinitiative/R-Instat/files/1193510/readodk.zip

rdstern commented 6 years ago

I contacted Sebastian, who is leaving and has been working on this facility in R, though not, of course in R-Instat. Here is his reply. It mentions the relatively new expss package, that might also be useful for us more generally, for labelled data, producing tables and multiple responses.

Hi Roger,

Sure, I just uploaded the script to the Stats4SD github. I’ll try to add a short working example in the next two weeks.

Have a look at the comments and documentation at the beginning of the odkFormat function.

The main function is odkFormat(): It uses the XLSForm and the data frame (excel or csv) and adds variable and value labels to the data frame based on the XLSForm. The variable labels come from the name column in the survey sheet, and the value labels from the choices sheet. Output is a labelled data frame, and also the form elements and a codebook as a list if you want it.

I’ve tested and improved the odkFormat and helper functions over the last two years and think it’s fairly robust by now, e.g. it handles ODK exports with and without group names attached and multiple choice variables.

The variable and value labels come from the expss package and are attributes of the variables. I use the function ft(), also in the script, to get the value labels when running analysis. The labelled output data frame should generally be worked on with tidyverse functions as opposed to basic R functions since the latter don’t handle the attributes too well. For example, use dplyr::filter(data, age > 30) instead of data[data$age>30,] . The nice thing about having the value labels as attributes is that you can use the codes to filter: dplyr::filter(data, gender == 1), but the labels to display: table(ft(data$gender)) will display the table with names “Male”, “Female” instead of the codes. If the label attributes cause trouble in any bits of your code further on, you can always remove them with an unlab() call.

The main piece of functionality that is lacking is to pull data and form directly from kobo or ona. At the moment you have to download the form and data file and read it in locally. I’ve tried this a few times but failed using the platform APIs in the right way. If you plan to do this, it might then make sense to use the json or xlm representation as opposed to the csv/excel files. Another nice to have would be to re-script it all into tidyverse syntax, although it might be easier to just leave it as it is. It’s not the most elegant code (a fairly clunky loop in the revarnames helper function), but then again I’ve tested it on at least a dozen of forms and data sets so it works quite well and I believe it does what it’s supposed to do.

Let me know in case you have any questions. Happy to explain in a call or in the office.

Cheers,

Sebastian