IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Multiple missing values #3040

Open rdstern opened 7 years ago

rdstern commented 7 years ago

The work with sjmisc is motivated by trying to help users who transfer from SPSS. The first important components is the factor (category) column set of issues which we are close to dealing with very well.

The second is the facility for R to cope with multiple (different missing value codes. These can be for any type of column, but are sort of related most easily to the factor columns. For example you may have a question with the following alternative answers: 0) Yes 1) No 95) Missing 96) Refused to answer 97) Not at home

In the alternatives above, any of the codes 95, 96 and 97 are all considered as missing (i.e. NA in R), but they are different. The important point is that we hope that missing values are "uninformative", i.e. we have no thought that if they were not n=missing they might have been Yes or No. Sometimes this is not the case, i.e. perhaps in the example above those who refused would have been more likely to answer no. (We sometimes discover this sort of evidence by adding a qualitative component to the study.)

In climatic data we might have missing values, so the date of the start would then be missing. Alternatively we might find no starting date in the period at all. This is sort of missing, but very different. It is more like censored data in that we know the start was later than we allowed for. One way to cope with this is to have an alternative missing value code.

The importing package we used copes with this in R by (apparently) having an extra byte added to the missing value code. This allows for an extra symbol to be added to NA, i.e. could also have NAa, NAb, etc. These are called tagged NA.

In my understanding this is a really neat trick, because ordinary R commands just considers these all the be NA and so are unaffected. But some commands can take account of this facility in ways that are useful.

So what do we want: 1) Importing is (I think) already OK. 2) Simple tables to find the frequencies of the different NAs should be OK, because the sjplot commands can take account of these codes. This may happen already, or we may add an extra option to those dialogues. 3) In these frequency tables this is one reason why the first column could be anything (i.e. numeric, or text, and not just a factor. 4) We should be able to get information on the tagged NA and also be able to make codes into tagged NA columns. I think this could (simply?) use the get_na and set_na functions in sjmisc.
5) Sometimes we will do this "behind the scenes", for example in our start of the rains dialogue. 6) In general I am not so clear on the general dialogues that would be useful here. This needs some discussion. Earlier (even with just one missing value) I was keen on a simple dialogue to set or unset a missing value. This is now all part of Prepare > Data Frame > Replace Values dialogue. So this could be extended. 7) But I wonder if now the subject is sufficiently important that we (also) have a new special dialogue? If so, then we need to think of its structure.

rdstern commented 7 years ago

The new sjlabelled package contains the facilities for tagged missing values. If they could be added in 4.1 they would also be useful in the climatic analyses. It would be a really good (and I hope simple feature to add. One place is the start of the rains dialogue where missing data would give a missing start, while no possible start could now give a tagged missing, which is different and could be reported as such.

rdstern commented 2 years ago

I now have data for analysis that (I think) needs this facility. I assume it will affect the data sheet and hence the data book structure, but I hope in a minor way. I think it is needed for importing data from SPSS better. (And from Stata or SAS that all allow multiple missing value codes. I hope it could easily be added in a minimal way, because I suggest that's all we need. An example data frame is hh - household from the MICS data. I give example variables at present. hc3 is a labelled numeric variable for the number of rooms for sleeping in the house. This is numeric from 1 upwards and with a single labelled value of 99 for NO RESPONSE. (There are just 6 there, and also 2180 ordinary missing values.) Another numeric variable is HH26B where the single labelled value is 97 and is INCONSISTENT. many factor variables also have this, for example HC7B have a radio is Yes/No and 9 = No RESPONSE

It is not possible (reliably) to specify 99 or 97 as missing on transfer, because there isn't yet any facility for that on input from SPSS and anyway those values might be ok for some other variables.
So I am assuming that if we have multiple missing value codes (via tagged values) than they will automatically be defined as missing. I'm not (yet) interested in analysing those codes, so would just like them treated as missing.