Closed Polkas closed 4 years ago
Dear Polkas, thanks for your suggestion.
It would be wonderful if we can speedup mice.impute.pmm()
. For datasets with over 100,000 rows mice.impute.pmm()
is time-consuming, spending most of its time in the partial sort. Our current advice is to break up larger datasets in chunks of 100,000, but that solution is not ideal.
A faster method should not affect any of the statistical properties. Could your code be considered a drop-in replacement for match.cpp
?
The current C
code called by mice.impute.pmm()
performs the following steps:
For each of the `n0` elements in `mis`
1) calculate the difference with `obs`
2) add small noise to break ties
3) find the `k` indices of the `k` closest predictors
4) randomly draw one index
Return the vector of `n0` matched positions.
Is each of these done in your solution? If so, I suggest that you open a pull request that will add mice.impute.pmmfast()
, so that we can rerun our validation code on your improvements.
Hi, There is no need to calculate differences and choose the smallest distance. This step is omit by finding closes match directly for every miss. The only thing that I did not implement is adding small noise to break ties, although It is not clear for me that this step is needed. This method is recommended for continuous variables. Still I could add random vector of length the same as obs to the obs vector but it forces us to sorting once again every new miss value iteration. Yes I could pull request if you will be satisfied with a new solution.
It is important to break ties explicitly. If we don't do that, we end up with the same set of matches for cases that have the same predicted value. The random generator does not need to be fancy, and may be fast and cheap.
The catch is that you may need to resort with every case. Perhaps there's a smarter way to break ties?
Ok, so I will think how to solve it as this is a important step. It might to be easy to implement, we will see.
My solution seems superior in terms of time efficiency but It returns values and do not taking into account ties as major of solutions (I will not want to change this behaviour at the moment). My friend @bbaranow https://github.com/bbaranow recommended to you another solution. https://cran.r-project.org/web/packages/RANN/RANN.pdf nn2() gives you much better performance and eps option (randomness - ties), be careful with default k. I will stick with my solution and implement it in my package as it is at least a few times faster than other best solutions. It is mainly much faster compared with nn2 with increasing k, which will be much bigger for longer datasets. The k should be at least and close to sqrt(nobs), in my opinion. Ps. another nice solution is knnx.index https://cran.r-project.org/web/packages/FNN/FNN.pdf If I could help you in some way write to me without hesitation.
Some benchmarks so x500 or x50, even x50 seems huge improvement
> vals = rnorm(100000)
>
> ss = rnorm(10000)
>
> neibo(vals,ss,200)[1:10]
[1] 0.81831921 0.08398758 -1.51465258 0.63023247 -1.49508517 -0.44044937 0.23820550 -0.35262505 -0.05485046 -0.07017419
>
> vals[mice:::matcher(vals,ss,200)][1:10]
[1] 0.82230085 0.08458459 -1.50485773 0.63139331 -1.50042102 -0.44243047 0.23895351 -0.35322907 -0.05334500 -0.07054163
>
> tt <- t(rmultinom(10000, 1L, rep(1L, 200)))
>
> vals[rowSums(nn2(vals,ss,200)$nn.idx*tt)][1:10]
[1] 0.82414926 0.08634572 -1.50921362 0.62780761 -1.49780378 -0.43951530 0.24146793 -0.35572542 -0.05485046 -0.07178673
>
> vals[knnx.index(vals,ss,200)][1:10]
[1] 0.82116125 0.08442159 -1.50936181 0.63020304 -1.50124389 -0.44015239 0.23931200 -0.35350583 -0.05536845 -0.07103484
>
> microbenchmark::microbenchmark(neibo(vals,ss,200),
+ mice:::matcher(vals,ss,200),
+ rowSums(nn2(vals,ss,200)$nn.idx*tt),
+ rowSums(nn2(vals,ss,200,searchtype='radius',eps=0.0001)$nn.idx*tt),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[1]),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[2]),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[3]),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[4]),
+ times=3)
Unit: milliseconds
expr min lq mean median uq max neval
neibo(vals, ss, 200) 28.21266 29.02797 29.52209 29.84328 30.1768 30.51031 3
mice:::matcher(vals, ss, 200) 15075.24839 15351.60940 15701.57446 15627.97041 16014.7375 16401.50457 3
rowSums(nn2(vals, ss, 200)$nn.idx * tt) 248.59326 260.20204 266.20996 271.81081 275.0183 278.22580 3
rowSums(nn2(vals, ss, 200, searchtype = "radius", eps = 1e-04)$nn.idx * tt) 91.31625 96.96789 132.15978 102.61954 152.5815 202.54354 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[1]) 241.95104 247.53542 254.26468 253.11980 260.4215 267.72321 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[2]) 1155.50126 1157.54041 1185.72831 1159.57957 1200.8418 1242.10411 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[3]) 6060.24203 6177.12370 6224.06726 6294.00536 6305.9799 6317.95439 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[4]) 4616.75115 4633.58245 4642.22515 4650.41375 4654.9621 4659.51054 3
Thanks for your comparisons.
I had a look at your suggestions.
nn2()
my understanding is that the eps
argument define the range of admitted matches. As far as I can see, it does not break ties.I updated previous benchmarks. As I understand eps could break ties if the another closest value is bigger/smaller than at least eps value. However I see that you are interested in case where there are an integer vector and some values are repeated, in such cases PMM mostly should not be used. PMM as i understand take the profit from variability around some data point. You want to have some variablility in terms of indexes but it is as i think not so usefull, we need only a values. Waht is the difference that user get value 10 from the index 10 or value 10 from the index 98.
Thanks for adding the knnx.index
method. If that method breaks ties, than it could be of interest for practical application in mice
.
Predictive mean matching is a versatile imputation method, and - when properly done - works for continuous and discrete outcomes, as well as for continuous and discrete predictors. I have written a section on its strengths and limitations in my book: https://stefvanbuuren.name/fimd/sec-pmm.html.
Dear all,
as a side note I would like to add that the idea of sorting also appeared here for a while:
https://github.com/alexanderrobitzsch/miceadds/blob/master/src/miceadds_rcpp_pmm6.cpp
Even a pure R version can be made relatively fast.
Best, Alexander
Dear all,
as a side note I would like to add that the idea of sorting also appeared here for a while:
https://github.com/alexanderrobitzsch/miceadds/blob/master/src/miceadds_rcpp_pmm6.cpp
Evan a pure R version can be made relatively fast.
Best, Alexander
Nice to meet you. Thanks for you response. I read this code and the R too. However this code contains a lot of copy operations. What could be an error is looking across -k,+k closest values not k closest. I understand that this bigger widow is an effect of this algorithm, we do not know in which direction go - approximation was accepted. Another major thing is to sort the data by y_hat (although y are imputed), there were compared y_hat (for obs) vs y_hat (for miss) not y (what we observed) vs y_hat (for miss) - this is losing information of what we observed so very unexpected. I find out that it could be easily changed to use yobs in the code. If i changed everything after regression by this 5 lines, I get what i/we want:
arma::uvec ind_mis = arma::find(ry01A==0);
arma::mat xmis=xA.rows(ind_mis);
Rcpp::NumericVector ypred_mis = Rcpp::as<Rcpp::NumericVector>(Rcpp::wrap(xmis * coef));
Rcpp::NumericVector y_full = Rcpp::as<Rcpp::NumericVector>(Rcpp::wrap(arma::sort(yobs)));
Rcpp::NumericVector yimp = neibo(y_full,ypred_mis,kk);
neibo is the function presented in this issue - matcher function.
A few adjustments is made to previous message.
Thank you for the response. I do not know why one should not sort y_hat because we are matching y_hat and not the observed y. Agreed that there is a lot of copying but I doubt that this increases computation time substantially. I also agree that one does not know the direction of searching, but I think that I handled it somehow (I wrote the function 3 or 4 years ago, so I have to again dive into the code).
I do not know why one should not sort y_hat because we are matching y_hat and not the observed y.
you assessed a y_hat for the whole dataset, even for observations which is not missing. We have vector c(y_hat_miss,y_hat_obs) then you sort it. When the vector is sorted you make a one loop over it using -k,+k widow and sample from it every time. Precisely i expected to sort c(y_hat_miss,y_obs (not hat)) and then use this window. Even that this method is only an approximation. I spend 1 hour on this code, it is very clever and tricky.
Alexander, thanks for the pointer to your approach.
Do you know whether your method makes any special provisions for handling ties in y_hat?
@stefvanbuuren: Yes, ties are broken. I have to say that I my intention was not to translate the original matching function into a faster code. First, I sort the vector (y_hat(obs), y_hat(mis)). I look for nearest neighbors for a particular y_hat(mis),i of case i. There could be k nearest neighbors in y_hat(obs) to y_hat(mis),i that are smaller and k nearest neighbors that are larger (if there exist k smaller or larger neighbors). I am aware that this approach is not the original pmm approach. For "typical" datasets, I do not suppose dramatic differences of the implemented approach and the mice approach. But, in any case, presorting data in advance and to conduct the case-wise matching will decrease computation time.
@Polkas: I still do not see why I should sort y_obs because the y_hat values are matched. You correctly described the implemented approach that use a window with smaller and larger y_hat values of the neighbors. When I originally wrote R code, the aim was to avoid the loop over cases with missing observations.
hey @alexanderrobitzsch if you hilite @stef you hilite the wrong guy, just like if i'd hilite @alex i'd piss off someone unrelated....
@stef: Agreed. I recognized it to late.
Here's a little contrived problem just to make the problem with ties more explicit.
Suppose we have the following donors (i.e., cases with y
observed):
data <- data.frame(
yhat = c(9, rep(10, 10), 11),
y = c(8, 6:15, 12))
> data
yhat y
1 9 8
2 10 6
3 10 7
4 10 8
5 10 9
6 10 10
7 10 11
8 10 12
9 10 13
10 10 14
11 10 15
12 11 12
Also, suppose we want to find k=5
matches for a target case that has yhat_1 = 9.8
. The closest match at yhat = 10
has 10 ties. Ideally, we would like to have a random draw from the values 6:15
(that all have the same closest yhat
).
However, if we're coming from below, we will select the five donors with y-values 6:10
. For a next target case, with - say - yhat_2 = 9.7
, we will select the same cases, and so on, leading to a systematic bias in the imputes.
Random breaking the ties would solve it.
Maybe I would like to add a suggestion. If one would have an appropriate matching procedure available that does not break ties, one could simply handle it by reshuffling the input data y
, ry
, and x
. Of course, the reshuffling has to be applied in every iteration.
@alexanderrobitzsch Thanks, I think reshuffling will help to get unsorted y
, and is quite fast, so it helps to average out the bias. Targets having the same yhat
will still obtain the same set of potential donors from which we draw one.
Perhaps it depends on the nature of the data whether "same set" would be problem. Can we trust reshuffling if we have a simple imputation model with just one categorical predictor? The answer is probably yes, but I will need a little more thinking.
Notwithstanding my rusty knowledge of C++, I was able to create a speedy function that incorporates many of the suggestions and enhancement noted above.
The matchindex()
function
matchindex()
;base::sample()
random generator so that we can keep track of .Random.seed
;std::lower_bound()
to search the ordered vector;k
matches;h
, so that we need only to find donor h <= k
;All in all, matchindex()
and neibo()
are now on par in terms of speed. Here are some timings results:
> set.seed(1)
> obs <- rnorm(1000)
> mis <- rnorm(1000)
> microbenchmark::microbenchmark(mice:::neibo(obs, mis, 2),
mice:::matcher(obs, mis, 2),
matchindex(obs, mis, 2))
Unit: microseconds
expr min lq mean median uq max neval
mice:::neibo(obs, mis, 2) 340 356 380 368 386 702 100
mice:::matcher(obs, mis, 2) 14082 14726 16083 14880 16228 24999 100
matchindex(obs, mis, 2) 254 269 295 289 315 432 100
> microbenchmark::microbenchmark(mice:::neibo(obs, mis, 5),
mice:::matcher(obs, mis, 5),
matchindex(obs, mis, 5))
Unit: microseconds
expr min lq mean median uq max neval
mice:::neibo(obs, mis, 5) 359 375 400 386 414 609 100
mice:::matcher(obs, mis, 5) 14363 14566 15878 14993 15744 24416 100
matchindex(obs, mis, 5) 288 305 338 335 355 504 100
> set.seed(1)
> obs <- rnorm(100000)
> mis <- rnorm(10000)
> microbenchmark::microbenchmark(mice:::neibo(obs, mis, 200),
mice:::matcher(obs, mis, 200),
matchindex(obs, mis, 200),
times = 3)
Unit: milliseconds
expr min lq mean median uq max neval
mice:::neibo(obs, mis, 200) 24.9 25.2 27.2 25.5 28.3 31.1 3
mice:::matcher(obs, mis, 200) 15614.2 15621.2 15784.9 15628.2 15870.3 16112.4 3
matchindex(obs, mis, 200) 28.3 28.8 30.1 29.3 31.0 32.8 3
Use the pmmc
branch to replicate. A new dev
argument to mice
allows you to replace matcher()
by matchindex()
.
For datasets under a 1000 rows, you won’t notice a big difference in speed. Speed gains are good, e.g. by a factor 2-3. However, for datasets with over 5000 records, the speed gains can be spectacular. For example, when overimputing a dataset with 10,000 rows predictive mean matching with matchindex()
is about 150 times faster than matcher()
. Predictive mean matching now approaches the speed of imputation under the normal model.
> library(mice)
> set.seed(1)
> walk <- data.frame(lapply(walking, as.numeric))[, 1:4]
> dim(walk)
[1] 890 4
> microbenchmark::microbenchmark(mice(walk, method = "pmm", print = FALSE),
mice(walk, method = "pmm", print = FALSE, dev = TRUE),
mice(walk, method = "norm", print = FALSE),
times = 20)
Unit: milliseconds
expr min lq mean median uq max neval
mice(walk, method = "pmm", print = FALSE) 239.6 245.1 251.4 248 255 281 20
mice(walk, method = "pmm", print = FALSE, dev = TRUE) 102.3 104.8 106.7 106 107 114 20
mice(walk, method = "norm", print = FALSE) 91.8 94.5 98.7 96 100 115 20
> data <- fdgs[, c(3, 5:6)]
> dim(data)
[1] 10030 3
> where <- make.where(data, "all")
> microbenchmark::microbenchmark(mice(data, where = where, print = FALSE),
mice(data, where = where, print = FALSE, dev = TRUE),
mice(data, where = where, method = "norm", print = FALSE),
times = 3)
Unit: milliseconds
expr min lq mean median uq max neval
mice(data, where = where, print = FALSE) 106890 107200 107352 107510 107583 107656 3
mice(data, where = where, print = FALSE, dev = TRUE) 713 743 765 774 792 809 3
mice(data, where = where, method = "norm", print = FALSE) 489 493 500 497 505 513 3
Thanks @Polkas for bringing these large gains to my attention. If you have any suggestions for further improvements, please let me know.
I expect that new algorithm will have the same statistical properties as the old one, but we still need to look into this and confirm proper behaviour.
Great to see this improvement. This put a mice PMM on another level. I will try to test it and provide some feedback shortly.
Awesome! Will do some testing.
I calculated part of table 3.3 from FIMD for the new matchindex
routine. The results are as follows:
Method | Bias | % Bias | Coverage | CI Width | RMSE | |
---|---|---|---|---|---|---|
Missing y, n = 50 | d | |||||
pmm |
1 | 0.016 | 5.4 | 0.884 | 0.252 | 0.071 |
pmm-dev |
1 | 0.016 | 5.7 | 0.892 | 0.254 | 0.073 |
pmm |
3 | 0.028 | 9.7 | 0.890 | 0.242 | 0.070 |
pmm-dev |
3 | 0.026 | 8.9 | 0.894 | 0.244 | 0.071 |
pmm |
5 | 0.039 | 13.6 | 0.876 | 0.241 | 0.075 |
pmm-dev |
5 | 0.037 | 12.8 | 0.884 | 0.243 | 0.074 |
pmm |
10 | 0.065 | 22.4 | 0.806 | 0.245 | 0.089 |
pmm-dev |
10 | 0.063 | 21.6 | 0.807 | 0.243 | 0.087 |
Missing x | ||||||
pmm |
1 | -0.002 | 0.8 | 0.916 | 0.223 | 0.063 |
pmm-dev |
1 | -0.001 | 0.3 | 0.916 | 0.224 | 0.064 |
pmm |
3 | 0.002 | 0.9 | 0.931 | 0.228 | 0.061 |
pmm-dev |
3 | 0.004 | 1.3 | 0.930 | 0.224 | 0.063 |
pmm |
5 | 0.008 | 2.8 | 0.938 | 0.237 | 0.062 |
pmm-dev |
5 | 0.010 | 3.4 | 0.941 | 0.238 | 0.064 |
pmm |
10 | 0.028 | 9.6 | 0.946 | 0.261 | 0.067 |
pmm-dev |
10 | 0.028 | 9.6 | 0.935 | 0.254 | 0.068 |
Listwise deletion | 0.000 | 0.0 | 0.946 | 0.251 | 0.063 | |
Missing y, n = 1000 | d | |||||
pmm |
1 | 0.001 | 0.2 | 0.929 | 0.056 | 0.014 |
pmm-dev |
1 | 0.001 | 0.3 | 0.960 | 0.068 | 0.014 |
pmm |
3 | 0.001 | 0.4 | 0.950 | 0.056 | 0.013 |
pmm-dev |
3 | 0.001 | 0.4 | 0.950 | 0.056 | 0.013 |
pmm |
5 | 0.002 | 0.6 | 0.951 | 0.055 | 0.013 |
pmm-dev |
5 | 0.002 | 0.7 | 0.944 | 0.056 | 0.013 |
pmm |
10 | 0.003 | 1.2 | 0.932 | 0.054 | 0.013 |
pmm-dev |
10 | 0.003 | 1.2 | 0.937 | 0.055 | 0.013 |
Missing x | ||||||
pmm |
1 | 0.000 | 0.2 | 0.926 | 0.041 | 0.011 |
pmm-dev |
1 | -0.002 | 0.6 | 0.988 | 0.085 | 0.013 |
pmm |
3 | 0.000 | 0.1 | 0.933 | 0.041 | 0.011 |
pmm-dev |
3 | -0.001 | 0.2 | 0.961 | 0.056 | 0.012 |
pmm |
5 | 0.000 | 0.1 | 0.937 | 0.042 | 0.011 |
pmm-dev |
5 | -0.000 | 0.1 | 0.952 | 0.049 | 0.012 |
pmm |
10 | 0.000 | 0.1 | 0.928 | 0.042 | 0.011 |
pmm-dev |
10 | -0.000 | 0.0 | 0.935 | 0.044 | 0.011 |
Listwise deletion | 0.000 | 0.1 | 0.955 | 0.050 | 0.012 |
For small samples, pmm
(current default) and pmm-dev
(new method) perform quite similar. For large samples, pmm-dev
seems to be even a little better calibrated than pmm
, and mostly erring on the conservative side. Thus, apart from a boost in speed, these results suggests that pmm-dev
is also superior in terms of statistical quality.
Commit 4df1f3d incorporates the pmm speedup in mice 3.11.8
.
I am a creator of miceFast package. Currently I am developing a faster PMM solution - more precisely your Rcpp matcher function. Here you could find a test code, it is even x1000 times faster or more:
https://github.com/Polkas/miceFast/issues/10
As you package is much more popular than my and this function is very important in your project It is very crucial to take into account a potential improvement.