FarrellDay / miceRanger

miceRanger: Fast Imputation with Random Forests in R
Other
67 stars 12 forks source link

There is no reproducibility #10

Closed statunizaga closed 3 years ago

statunizaga commented 3 years ago

Well there is a issue here, For example: Lets have 2 data sets where 1 observation is repeated and when i apply the impute function. I hope to get the same imputed value for the same observation but that is not the case. the imputed value is different. Help!

samFarrellDay commented 3 years ago

You can get reproduceable results by using the set.seed() function. This will ensure you get the same results if the same script is called multiple times. It will not, however, ensure the same results if you call the function the same way multiple times in the same session (unless you use set.seed() multiple times).

There is randomness involved when imputing new data. This comes from the mean matching. In the mean matching process, 1 final value is chosen at random from the N closest predictions from the model. So, you shouldn't expect impute to always be deterministic, unless you have valueSelector = 'value'.

statunizaga commented 3 years ago

Yeah, i used set.seed() and i got 2 different imputed values for the same observation. Im gonna try with valueselector="value". we are in touch.

samFarrellDay commented 3 years ago

The following shows what I am talking about:

require(miceRanger)

# Setup
data(iris)
ampIris <- amputeData(iris,perc=0.25)

miceObj <- miceRanger(ampIris,verbose=FALSE,returnModels = TRUE)

set.seed(1)
i1 = impute(ampIris,miceObj)

set.seed(1)
i2 = impute(ampIris,miceObj)

# Shows all values are true
i1$imputedData$Dataset_1 == i2$imputedData$Dataset_1

This shows that using set.seed should result in the same imputed values.

statunizaga commented 3 years ago

I got it. Problem solved with valueselector="value". Finally thanks for creating this package. Its Amazing.