IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Analysing a survey data set that may have many variables. Data from Pakistan and Kenya #6259

Closed rdstern closed 3 years ago

rdstern commented 3 years ago

A dataset is avaialble for a reasonably large survey. It is originally in SPSS and has 330 variables and about 53000 cases. We should (eventually) be able to process the data easily in R-Instat. Currently it seems awkward. What can we do about it?

It is also useful to investigate a dataset from a similar type of survey from Kenya!

These are both examples of MICS surveys from UNICEF. The site is here.

You will need to register to download the data. Please register and say you want the data from Kakamega, Kenya, to be able to see how to repeat the SPSS analyses using R. Do this as soon as possible, because they say permission may take up to 3 days.

Then select Kenya and download the data from the 2014 Kakamega survey. Also download the results and reports. You can do that immediately.

Also download the survey from Pakistan - Punjab 2017/8.

The Punjab data are used in Pakistan (with SPSS) in practicals to engage students in the learning. For us I would like you to use these data for multiple purposes, i.e. both to possibly (later) help in teaching and also to evaluate how well R-Instat can cope with a large dataset.

We will discuss particular tasks for the interns below. I assume the Kenya surveys could be useful for teaching and they import easily.

rdstern commented 3 years ago

We are working with a Professor of statistics from Pakistan. She uses data from a MICS survey to enrich her teaching of statistics. Similar data are available for many countries. image

Perhaps, then many teachers of statistics could make use of MICS surveys similarly? How should they do this? There are many components to the answer. In developing R-Instat we are very keen to promote better teaching of statistics. Statistics lecturers (in Africa) have a very heavy teaching load. So one idea is to make it as easy as possible to make improvements in the teaching. To do this we need teachers to be able, very quickly, to gather the teaching materials and share them with their students.

The examples of data I suggest we use are the recent MICS survey from Punjab, Pakistan, as an example that can be discussed with their Professor and one (or more) of the recent surveys from Kenya. I consider the data from Kakamega as an example, but two other counties are alternatives.

What are the materials?
a) I was able to download the reports easily. b) After registering I could download the data c) I found the data are in SPSS files. It reads easily into R and hence R-Instat.
d) It consists of multiple files. I have initially looked at one called hh, which I presume is the household level file, though there is another file called hl? e) I found that each variable has a (sometimes long) label which is very helpful. f) I have not yet found about about each file, nor about the design and data collection. I assume there is information about them, on the site. g) Among the reasons this looks like a good example is that the topic is of broad interest and easy to follow. There are examples from many countries. So lecturers have a good chance of choosing a local example. The details and data are easily available. The data look pretty "tidy" so this sets a good example.

So your first deliverable is to make this case and then to explain how access to this information should be provided for an R-Instat user.

(Please note that in our promotion of R-Instat we are not trying to stop SPSS users. Our interest is in good statistical practices and good and effective teaching. R-Instat, like SPSS is simply a tool to help. We believe in "options by context". So if people are happy with SPSS, then we would like to encourage them be even happier. For some people, adding R might make them even happier, and others might like R-Instat.)

rdstern commented 3 years ago

Another use of these datasets is to test existing and new routines in R-Instat. In particular, the analyses of these data - almost all categorical - is via tables and graphs. That's the describe menu in R-Instat. I use the Experimental Survey data set a lot (as a small convenient set). It is good to have a much larger survey and these are fine.

maxwellfundi commented 3 years ago

@jkmusyoka whats the situation with this? @rdstern any ready tasks for interns yet on this?