SteffenMoritz / imputeR

CRAN R package: Impute missing values based on automated variable selection
GNU General Public License v3.0
13 stars 3 forks source link

Should there be a convergence tolerance parameter for `impute`? #3

Open travis-leith opened 3 years ago

travis-leith commented 3 years ago

I know there is a parameter for maximum iterations. But what about something to control when convergence has been reached? What is the convergence criterion anyway? I have been testing this on some synthetic data using pcrR and it doesn't seem to be stopping even though "Difference" is less than 8e-16.

SteffenMoritz commented 3 years ago

Hey Travis, thanks for opening an issue.

Yes, I think this is true, processing only stops, when maxiter is reached.

The package was originally build by Lingbing during his PhD research (but he had no more time maintaining it). It works as intended, but you are totally right, could be more user friendly and streamlined in some parts. I already added several updates into this direction, but the last years I did not find time for any more extensive updates.

About the convergence criterium (I had to look this up myself): In the code it says in a comment, that it is derived from MissForest's convergence criterium.

Here is the imputeR code for it:

if (t.type == "numeric") {
                convNew[t.co2] <- sum((ximp[, t.ind] - ximp.old[, t.ind])^2)/sum(ximp[, t.ind]^2)
            } 
else {
                dist <- sum(as.character(as.matrix(ximp[, t.ind])) != 
                  as.character(as.matrix(ximp.old[, t.ind])))
                convNew[t.co2] <- dist/(n * sum(Type == "character"))
            }

So for numeric data it seems to be: Sum of squared residuals divided by sum of squared actual values ( the residuals are (actual imputations) - (imputations last iterations)

In my opinion a reasonable convergence criterium. (there would be other formulas that work in the same manner) The main takeaway here is: the smaller the differences in imputed values between subsequent iterations, the smaller the convergence value will be. If there is no difference between iterations, the convergence value will be 0.

So it basically tracks if there is still a change in values (convergence) between different iterations. (maybe worth mentioning: it does not try to determine fit on the data)

Seems like a very reasonable idea to add an option to stop when a certain convergence is reached. Or at least automatically stop, when it is 0. I'll add this to the ToDo list.

Just in case you are interested - we are also always open to pull requests and further contributors to the package. (just in case you'd like to add something )