HertieDataScience / SyllabusAndLectures

Hertie School of Governance Introduction to Collaborative Social Science Data Analysis
MIT License
37 stars 60 forks source link

Multiple imputation of missing values #101

Open Diegotab opened 8 years ago

Diegotab commented 8 years ago

Hi!

I have been trying to use the package "missForest" to impute the missing values in our dataset for a while. However, I keep getting the same error. Moreover, I also tried to use the package "mice" but it didn´t work either. Please find attached a screenshot of the error. screenshot 2016-04-29 18 32 13

mberneaud commented 8 years ago

What do you mean by find NAs? Are you just trying to find the position of the NAs in your entire data frame? Or do the NAs in your data set have some weird value which has to be coded NA for the further use of the data set?

Your question intrigued my interest, so please provide some more context so we can try and solve it.

sloloris commented 8 years ago

@mberneaud so basically, we want to run some linear models for different years of our panel dataset and compare them, but there are values of certain key variables that are missing for certain years, namely the Gini coefficient for different countries which is often only supplied every couple of years for each country. Since it also tends not to change much, we were hoping to be able to simply pull the available values across the other years. We realize that a more statistically sound method would be to actually impute new values, but for so many of the countries there's only one or two values out of the 14 years we're covering available.

Let us know if you have any concept of how to do that! It would be super helpful. Otherwise we might have to drop the variable.

mberneaud commented 8 years ago

I could've helped you with any kind of finding and eliminating missing variables, but I've never done imputations to replace missing values, so unfortunately I can't share any past experiences on the matter.

As it looks there are NAs in either the variable which you specified as X or Y and which are used for the imputation. I'd suggest you try and see if / how many NAs exist in the variables which are specified to train the algorithm. Generally I do such stuff with the following code:

any(is.na(dataframe$variable))  
# Returns TRUE if there are any NAs in the variable vector

sum(is.na(dataframe$variable))
# returns the number of NAs in the vector

Another thing: what are you using the prodNA function for if your data already contains missings? As I understand the documentation of the package, this is just used for testing functions where you have a complete matrix, introduce some random NAs and then see how well missForest fares in imputing the correct values for the NAs. Have you tried running missForest on your "real" data without any NAs introduced manually?

christophergandrud commented 8 years ago

My main experience imputing missing data is with Amelia. Generally works pretty well.

I think in your case, if I remember correctly, you can do much simpler imputation. For gini, just assume that it is the same for missing years as the most recent observation year. Gini generally doesn't change that dramatically over a time spans of a few years and so my prior is that its unlikely that a more complicated imputation model would predict missing gini better.

sloloris commented 8 years ago

@christophergandrud what would be the correct command for this? Everything in the Amelia package seems to want to construct a model to impute the values, which probably isn't really possible anyways since for many of the countries we only have the Gini coefficient for one or two of the 12 years. I'm assuming we can write some sort of loop that we can then apply to all of the countries, but I've tried googling and I can't even really figure out what it would be called to pull over the values like that and I can't seem to find any tutorials anywhere.

christophergandrud commented 8 years ago

If you just want to fill in a variable with the previous values create a look like (assuming the variable is already in time order:

Data <- data.frame(first = c(rep('A', 5), rep('B', 5)),
                   second = c(1, NA, NA, NA, 3, NA, 1, NA, 3, 3))

for (i in 1:nrow(Data)) {
    if (is.na(Data[i, 'second'])) Data[i, 'second'] <- Data[i-1, 'second']
}

Note that I think you have grouped data so you can turn the loop into a function and use dplyr to apply it to each group separately. Something like:

library(dplyr)

# Note that the new version of the function has been modified to work with the variable directly,         
fill_down <- function(x) {
    for (i in 1:length(x)) {
        if (i != 1 & is.na(x[i])) x[i] <- x[i-1]
    }
    return(x)
}

Data <- Data %>% group_by(first) %>%
            mutate(second_imputed = fill_down(second))

You might even want a function that doesn't just fill missing values with previous values, but also missing values with future values, if there are no previous values (this might make sense for Gini over short time spans:

fill_down_up <- function(x) {
    for (i in 1:length(x)) {
        if (i != 1 & is.na(x[i])) x[i] <- x[i-1]
    }
    for (i in length(x):1) {
        if (is.na(x[i])) x[i] <- x[i+1]
    }
    return(x)
}

Data <- Data %>% group_by(first) %>%
            mutate(second_imputed_2 = fill_down_up(second))

Hope that helps.