IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

R-Instat with very wide (or large) files from Kenya #4832

Closed rdstern closed 1 year ago

rdstern commented 6 years ago

Kenya National Bureau of Statistics (KNBS) has some open data files that we may wish to use in a proposed project. One of these is the 2014 KNDS (Kenya Demographic and Health survey. These data may well also be available from an alternative site also.

From the KNBS site I was easily able to get permission and to download the files from these data. They are SPSS files. (Other open data are Stata files.) I get the impression that these files are only offered in one format.

I downloaded the 7 files from this survey. Some import into R-Instat easily, but others say I don't have sufficient memory on my computer. I don't yet know whether the problem is width or volume. I assume not length, because the largest number of records is the household members file (KEPR) with 153,840 records and 383 variables and this imports fine. The household file (KEHR) does not import (says not enough memory). It is 36430 records and 2439 variables. Neither does the individual file (KEIR) with 31,079 records and 4769 variables.

The set of 7 files is not trivial to download in Kenya. They total 500+ mbytes. I will check further in the UK - looking for machines with more memory initially. How we will process files with 2,500 variables is another matter!

I also noticed that one file that did import (KEKR) 20,964 records and 1,099 variables did seem to slow my use of R-Instat considerably. I was analysing other data frames (not this one), but refreshing the grid - or something, seemed to take 20-30 seconds regularly. David says this is a known problem?

This is just for the record at this stage.

dannyparsons commented 6 years ago

The first thing to check for the memory issue is importing into RStudio and seeing if you get the same memory problems.

rdstern commented 6 years ago

I tried with one example and get the same message in RStudio.

dannyparsons commented 6 years ago

Then that is just an issue of memory usage in R. There are ways you can increase the limit in R but of course it's always still limited by your computer.

rdstern commented 3 years ago

I have now looked at these files again - with the recent 64 bit installation of R-Instat. The wide data are each from spss files and (now) read easily. It is impressive that they import with variable and value labels, which is good. I first imported the household file (KEHR) which has 36430 records and 2439 variables. It is close to 100mbytes in SPSS. Then I got greedy and added the second file (KEIR) with 31,079 records and 4769 variables. This is 160mytes. I have now saved them together as an rds file - about 30+ mbytes, which takes just over 30 seconds to read into R-Instat.

These will be severe tests for our dialogues. I tried with the smaller file and it was a reasonably acceptable 4-5 seconds to populate the data selector.

I also looked (now) in more detail at the files. In each case the reason they are so wide (so many variables) is because (I think) they are actually at 2 levels, but stuck together. So KEHR has many variables in blocks of 23, which may be the maximum number of individuals per household. There are about 220 variables at the household level, and then a lot of these blocks. I suggest getting these files into a better shape is another useful aspect of a statisticians need for data wrangling!

Similarly KEIR has a lot of block of 20 and also many of size 6.

In each case the naming is logical, so select, and also various ways of being able to select subsets of variables in the selectors, will be very useful.

If these are needed I can supply. I am not sure they should be publically available in github though.

Just for reference if we want to experiment with a much simpler, but fairly wide data set, then in the library the package DAAG includes a file called rockArt that has 641 variables, and is a more modest size to start with - just 103 rows.

rdstern commented 3 years ago

I suggest these data (and perhaps others from KNBS be used by our documentation team as sensible case-study materials. They should report on what the are able to do with these open data. The first step is to be able to locate and then download the data. Then they will need to be "wrangled" so that they are in a tidy shape. How could this be done.

They also need to be understood first and we should identify some problems to solve, so the data are tidied in ways that make this an easy task.

I hope, like the UNICEF MIPS data this work might be in collaboration with the statisticians in a county office, or (alternatively) with some in a University, preferably both. So we combine the work with a few more people using the data.

I hope we could have a discussion on this work, perhaps next week (from 5 July 2021)

rdstern commented 3 years ago

@Wycklife please can you reply here that you have received this message? Later perhaps you can say whether there are now better examples of data to use?

Wycklife commented 3 years ago

@rdstern I received this message and I have been looking at Kenya survey data sets(Kakamega, Bungoma, and Turkana). Once am done getting them into the right shape, then I will communicate. Thank you.

rdstern commented 3 years ago

@Wycklife please note those are different to the data sets mentioned here. You mentioned you would like to multitask, so I thought you might like to look at the above examples as well.

I am also not quite sure what you are meaning about getting the MICS datasets "into the right shape". I assume they are SPSS files and so are reasonable when downloaded. I look forward to your description of actions. Please also remember you should be using this work, both to examine the data and also to examine R-Instat.

Wycklife commented 3 years ago

@rdstern I am happy with this task. What I meant by "getting the data sets in the right format", was checking whether the datasets reads in R-Instat with the labels. Jus as I have been doing to the Pakistan and other data sets from MICS. These data sets too, are a bit large in and I thought including them with the ones mentioned above would be nice.

I am going through the task and will give my feedback later.