IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Adding IPUMS data into R-Instat partly for Zambia conference? #5617

Open rdstern opened 4 years ago

rdstern commented 4 years ago

Here is the IPUMS home page: image

They are at many ISI conferences, promoting the use of "their" data. In preparation for our possible attendance and papers at this conference I downloaded the latest Zambia census, which is 2010. Could be interesting to analyse in 2020, because that is presumably when they will do their next census. So showing some analyses of 2010 could start users discussing change.

Here is the download page for me: image

So, I downloaded 118 variables from the 10% sample and found it has 1.3 million records. Our "ordinary" File > Open Data file actually works on the downloaded dat file, but loads it into a single variable!
I also downloaded the dat file, then the R file, which is very short, then the xml ddi file.
Here is my adaptation of their R file for R-Instat:

# NOTE: To load data, you must download both the extract's data and the DDI
# and also set the working directory to the folder with these files (or change the path below).

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")

zambia <-ipumsr:: ipums_example("C:/Users/RogerStern/Dropbox (SSD)/R-Instat Library Datasets/IPUMS")
ddi <- ipumsr::read_ipums_ddi("C:/Users/RogerStern/Dropbox (SSD)/R-Instat Library Datasets/IPUMS/ipumsi_00007.xml")
data <- ipumsr::read_ipums_micro(ddi)
data_book$import_data(data_tables=list(data=data))

I installed the ipumsr package (from within R-Instat) and ran this in the R-Instat script window. It worked - rather to my surprise.

It actually works very well in that the data all come in as numeric, but they are labelled and so the labels attach automatically when you make them into factors.

So what should we do? a) Have a simple Open from IPUMS menu item and dialogue. (I am sure we can also hide it as an option in our File > Open from File dialogue, but this would also be a public way of supporting IPUMS data. It should be an easy dialogue, though we may want to see what else is in the ipumsr package that might be useful at the same time. b) Check on whether there are improvements needed in the column metadata window. This would be a separate issue, and possibly useful in general. Here is a picture of part of the file: image

There is a last - sometimes very long variable description field (in addition to the variable name and label). Notice no left or right scrolling, so I can't here move to the left to see the variable name, nor to the right to read the full description. Perhaps any extra fields should have a default width and allow word wrap and also (possibly) editing? The labelling is automatic here. If we make it a factor and edit the labels, does this change the labelling in the metadata? By the way the dat file is about 350 mbytes, though the zipped file is under 30. The R data file is 33 mbytes.

rdstern commented 4 years ago

I have now found the situation in the column metadata is both better and worse than I thought. a) What's better is that with the view shown above you can double-click in the name of the final field and then the scrolling is ok. b) But you then lose the name of the field c) And I tried to edit the field, because it was messing up the structure. After clicking a bit it said:

image

And then - more scary it said our really threatening message:

image