Faster PMM by pre-sorting and binary search

Polkas commented 4 years ago

I am a creator of miceFast package. Currently I am developing a faster PMM solution - more precisely your Rcpp matcher function. Here you could find a test code, it is even x1000 times faster or more:

https://github.com/Polkas/miceFast/issues/10

As you package is much more popular than my and this function is very important in your project It is very crucial to take into account a potential improvement.

stefvanbuuren commented 4 years ago

Dear Polkas, thanks for your suggestion.

It would be wonderful if we can speedup mice.impute.pmm(). For datasets with over 100,000 rows mice.impute.pmm() is time-consuming, spending most of its time in the partial sort. Our current advice is to break up larger datasets in chunks of 100,000, but that solution is not ideal.

A faster method should not affect any of the statistical properties. Could your code be considered a drop-in replacement for match.cpp?

The current C code called by mice.impute.pmm() performs the following steps:

For each of the `n0` elements in `mis`

1) calculate the difference with `obs`
2) add small noise to break ties
3) find the `k` indices of the `k` closest predictors
4) randomly draw one index

Return the vector of `n0` matched positions.

Is each of these done in your solution? If so, I suggest that you open a pull request that will add mice.impute.pmmfast(), so that we can rerun our validation code on your improvements.

Polkas commented 4 years ago

Hi, There is no need to calculate differences and choose the smallest distance. This step is omit by finding closes match directly for every miss. The only thing that I did not implement is adding small noise to break ties, although It is not clear for me that this step is needed. This method is recommended for continuous variables. Still I could add random vector of length the same as obs to the obs vector but it forces us to sorting once again every new miss value iteration. Yes I could pull request if you will be satisfied with a new solution.

stefvanbuuren commented 4 years ago

It is important to break ties explicitly. If we don't do that, we end up with the same set of matches for cases that have the same predicted value. The random generator does not need to be fancy, and may be fast and cheap.

The catch is that you may need to resort with every case. Perhaps there's a smarter way to break ties?

Polkas commented 4 years ago

Ok, so I will think how to solve it as this is a important step. It might to be easy to implement, we will see.

Polkas commented 4 years ago

My solution seems superior in terms of time efficiency but It returns values and do not taking into account ties as major of solutions (I will not want to change this behaviour at the moment). My friend @bbaranow https://github.com/bbaranow recommended to you another solution. https://cran.r-project.org/web/packages/RANN/RANN.pdf nn2() gives you much better performance and eps option (randomness - ties), be careful with default k. I will stick with my solution and implement it in my package as it is at least a few times faster than other best solutions. It is mainly much faster compared with nn2 with increasing k, which will be much bigger for longer datasets. The k should be at least and close to sqrt(nobs), in my opinion. Ps. another nice solution is knnx.index https://cran.r-project.org/web/packages/FNN/FNN.pdf If I could help you in some way write to me without hesitation.

Polkas commented 4 years ago

Some benchmarks so x500 or x50, even x50 seems huge improvement

> vals = rnorm(100000)
> 
> ss = rnorm(10000)
> 
> neibo(vals,ss,200)[1:10]
 [1]  0.81831921  0.08398758 -1.51465258  0.63023247 -1.49508517 -0.44044937  0.23820550 -0.35262505 -0.05485046 -0.07017419
> 
> vals[mice:::matcher(vals,ss,200)][1:10]
 [1]  0.82230085  0.08458459 -1.50485773  0.63139331 -1.50042102 -0.44243047  0.23895351 -0.35322907 -0.05334500 -0.07054163
> 
> tt <- t(rmultinom(10000, 1L, rep(1L, 200)))
> 
> vals[rowSums(nn2(vals,ss,200)$nn.idx*tt)][1:10]
 [1]  0.82414926  0.08634572 -1.50921362  0.62780761 -1.49780378 -0.43951530  0.24146793 -0.35572542 -0.05485046 -0.07178673
> 
> vals[knnx.index(vals,ss,200)][1:10]
 [1]  0.82116125  0.08442159 -1.50936181  0.63020304 -1.50124389 -0.44015239  0.23931200 -0.35350583 -0.05536845 -0.07103484
> 
> microbenchmark::microbenchmark(neibo(vals,ss,200),
+                                mice:::matcher(vals,ss,200),
+                                rowSums(nn2(vals,ss,200)$nn.idx*tt),
+                                rowSums(nn2(vals,ss,200,searchtype='radius',eps=0.0001)$nn.idx*tt),
+                                knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[1]),
+                                knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[2]),
+                                knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[3]),
+                                knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[4]),
+                                times=3)
Unit: milliseconds
                                                                                     expr         min          lq        mean      median         uq         max neval
                                                                     neibo(vals, ss, 200)    28.21266    29.02797    29.52209    29.84328    30.1768    30.51031     3
                                                            mice:::matcher(vals, ss, 200) 15075.24839 15351.60940 15701.57446 15627.97041 16014.7375 16401.50457     3
                                                  rowSums(nn2(vals, ss, 200)$nn.idx * tt)   248.59326   260.20204   266.20996   271.81081   275.0183   278.22580     3
         rowSums(nn2(vals, ss, 200, searchtype = "radius", eps = 1e-04)$nn.idx *      tt)    91.31625    96.96789   132.15978   102.61954   152.5815   202.54354     3
 knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree",      "CR", "brute")[1])   241.95104   247.53542   254.26468   253.11980   260.4215   267.72321     3
 knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree",      "CR", "brute")[2])  1155.50126  1157.54041  1185.72831  1159.57957  1200.8418  1242.10411     3
 knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree",      "CR", "brute")[3])  6060.24203  6177.12370  6224.06726  6294.00536  6305.9799  6317.95439     3
 knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree",      "CR", "brute")[4])  4616.75115  4633.58245  4642.22515  4650.41375  4654.9621  4659.51054     3

stefvanbuuren commented 4 years ago

Thanks for your comparisons.

I had a look at your suggestions.

For nn2() my understanding is that the eps argument define the range of admitted matches. As far as I can see, it does not break ties.
The FNN approach explicitly mentions tie breaking. Do you have simulations for that as well?

Polkas commented 4 years ago

I updated previous benchmarks. As I understand eps could break ties if the another closest value is bigger/smaller than at least eps value. However I see that you are interested in case where there are an integer vector and some values are repeated, in such cases PMM mostly should not be used. PMM as i understand take the profit from variability around some data point. You want to have some variablility in terms of indexes but it is as i think not so usefull, we need only a values. Waht is the difference that user get value 10 from the index 10 or value 10 from the index 98.

stefvanbuuren commented 4 years ago

Thanks for adding the knnx.index method. If that method breaks ties, than it could be of interest for practical application in mice.

Predictive mean matching is a versatile imputation method, and - when properly done - works for continuous and discrete outcomes, as well as for continuous and discrete predictors. I have written a section on its strengths and limitations in my book: https://stefvanbuuren.name/fimd/sec-pmm.html.

alexanderrobitzsch commented 4 years ago

Dear all,

as a side note I would like to add that the idea of sorting also appeared here for a while:

https://github.com/alexanderrobitzsch/miceadds/blob/master/src/miceadds_rcpp_pmm6.cpp

Even a pure R version can be made relatively fast.

Best, Alexander

Polkas commented 4 years ago

Dear all,

as a side note I would like to add that the idea of sorting also appeared here for a while:

https://github.com/alexanderrobitzsch/miceadds/blob/master/src/miceadds_rcpp_pmm6.cpp

Evan a pure R version can be made relatively fast.

Best, Alexander

Nice to meet you. Thanks for you response. I read this code and the R too. However this code contains a lot of copy operations. What could be an error is looking across -k,+k closest values not k closest. I understand that this bigger widow is an effect of this algorithm, we do not know in which direction go - approximation was accepted. Another major thing is to sort the data by y_hat (although y are imputed), there were compared y_hat (for obs) vs y_hat (for miss) not y (what we observed) vs y_hat (for miss) - this is losing information of what we observed so very unexpected. I find out that it could be easily changed to use yobs in the code. If i changed everything after regression by this 5 lines, I get what i/we want:

arma::uvec ind_mis = arma::find(ry01A==0);
arma::mat xmis=xA.rows(ind_mis);

Rcpp::NumericVector ypred_mis = Rcpp::as<Rcpp::NumericVector>(Rcpp::wrap(xmis * coef));
Rcpp::NumericVector y_full = Rcpp::as<Rcpp::NumericVector>(Rcpp::wrap(arma::sort(yobs)));
Rcpp::NumericVector yimp = neibo(y_full,ypred_mis,kk);

neibo is the function presented in this issue - matcher function.

Polkas commented 4 years ago

A few adjustments is made to previous message.

alexanderrobitzsch commented 4 years ago

Thank you for the response. I do not know why one should not sort y_hat because we are matching y_hat and not the observed y. Agreed that there is a lot of copying but I doubt that this increases computation time substantially. I also agree that one does not know the direction of searching, but I think that I handled it somehow (I wrote the function 3 or 4 years ago, so I have to again dive into the code).

Polkas commented 4 years ago

I do not know why one should not sort y_hat because we are matching y_hat and not the observed y.

you assessed a y_hat for the whole dataset, even for observations which is not missing. We have vector c(y_hat_miss,y_hat_obs) then you sort it. When the vector is sorted you make a one loop over it using -k,+k widow and sample from it every time. Precisely i expected to sort c(y_hat_miss,y_obs (not hat)) and then use this window. Even that this method is only an approximation. I spend 1 hour on this code, it is very clever and tricky.

stefvanbuuren commented 4 years ago

Alexander, thanks for the pointer to your approach.

Do you know whether your method makes any special provisions for handling ties in y_hat?

alexanderrobitzsch commented 4 years ago

@stefvanbuuren: Yes, ties are broken. I have to say that I my intention was not to translate the original matching function into a faster code. First, I sort the vector (y_hat(obs), y_hat(mis)). I look for nearest neighbors for a particular y_hat(mis),i of case i. There could be k nearest neighbors in y_hat(obs) to y_hat(mis),i that are smaller and k nearest neighbors that are larger (if there exist k smaller or larger neighbors). I am aware that this approach is not the original pmm approach. For "typical" datasets, I do not suppose dramatic differences of the implemented approach and the mice approach. But, in any case, presorting data in advance and to conduct the case-wise matching will decrease computation time.

@Polkas: I still do not see why I should sort y_obs because the y_hat values are matched. You correctly described the implemented approach that use a window with smaller and larger y_hat values of the neighbors. When I originally wrote R code, the aim was to avoid the loop over cases with missing observations.

stef commented 4 years ago

hey @alexanderrobitzsch if you hilite @stef you hilite the wrong guy, just like if i'd hilite @alex i'd piss off someone unrelated....

alexanderrobitzsch commented 4 years ago

@stef: Agreed. I recognized it to late.

stefvanbuuren commented 4 years ago

Here's a little contrived problem just to make the problem with ties more explicit.

Suppose we have the following donors (i.e., cases with y observed):

data <- data.frame(
   yhat = c(9, rep(10, 10), 11),
   y = c(8, 6:15, 12))
> data
  yhat  y
1     9  8
2    10  6
3    10  7
4    10  8
5    10  9
6    10 10
7    10 11
8    10 12
9    10 13
10   10 14
11   10 15
12   11 12

Also, suppose we want to find k=5 matches for a target case that has yhat_1 = 9.8. The closest match at yhat = 10 has 10 ties. Ideally, we would like to have a random draw from the values 6:15 (that all have the same closest yhat).

However, if we're coming from below, we will select the five donors with y-values 6:10. For a next target case, with - say - yhat_2 = 9.7, we will select the same cases, and so on, leading to a systematic bias in the imputes.

Random breaking the ties would solve it.

alexanderrobitzsch commented 4 years ago

Maybe I would like to add a suggestion. If one would have an appropriate matching procedure available that does not break ties, one could simply handle it by reshuffling the input data y, ry, and x. Of course, the reshuffling has to be applied in every iteration.

stefvanbuuren commented 4 years ago

@alexanderrobitzsch Thanks, I think reshuffling will help to get unsorted y, and is quite fast, so it helps to average out the bias. Targets having the same yhat will still obtain the same set of potential donors from which we draw one.

Perhaps it depends on the nature of the data whether "same set" would be problem. Can we trust reshuffling if we have a simple imputation model with just one categorical predictor? The answer is probably yes, but I will need a little more thinking.

stefvanbuuren commented 4 years ago

New matchindex() function

Notwithstanding my rusty knowledge of C++, I was able to create a speedy function that incorporates many of the suggestions and enhancement noted above.

The matchindex() function

returns an index instead of a value, so that the caller remains flexible. The function name reflects this property;
introduces a data pre-shuffle (as suggested by Alexander), which breaks ties over repeated calls to matchindex();
uses the base::sample() random generator so that we can keep track of .Random.seed;
use std::lower_bound() to search the ordered vector;
use the elegant donor finding of Polkas to traverse through the first k matches;
introduces early stopping in tree traversing based on the pre-sampled donor h, so that we need only to find donor h <= k;
renames function arguments to stimulate a more general use;
adds documentation and examples;
exports the function.

All in all, matchindex() and neibo() are now on par in terms of speed. Here are some timings results:

> set.seed(1)
> obs <- rnorm(1000)
> mis <- rnorm(1000)
> microbenchmark::microbenchmark(mice:::neibo(obs,  mis, 2),
                                mice:::matcher(obs, mis, 2),
                                matchindex(obs, mis, 2))
Unit: microseconds
                        expr   min    lq  mean median    uq   max neval
   mice:::neibo(obs, mis, 2)   340   356   380    368   386   702   100
 mice:::matcher(obs, mis, 2) 14082 14726 16083  14880 16228 24999   100
     matchindex(obs, mis, 2)   254   269   295    289   315   432   100
> microbenchmark::microbenchmark(mice:::neibo(obs,  mis, 5),
                                mice:::matcher(obs, mis, 5),
                                matchindex(obs, mis, 5))
Unit: microseconds
                        expr   min    lq  mean median    uq   max neval
   mice:::neibo(obs, mis, 5)   359   375   400    386   414   609   100
 mice:::matcher(obs, mis, 5) 14363 14566 15878  14993 15744 24416   100
     matchindex(obs, mis, 5)   288   305   338    335   355   504   100
> set.seed(1)
> obs <- rnorm(100000)
> mis <- rnorm(10000)
> microbenchmark::microbenchmark(mice:::neibo(obs,  mis, 200),
                                mice:::matcher(obs, mis, 200),
                                matchindex(obs, mis, 200), 
                                times = 3)
Unit: milliseconds
                          expr     min      lq    mean  median      uq     max neval
   mice:::neibo(obs, mis, 200)    24.9    25.2    27.2    25.5    28.3    31.1     3
 mice:::matcher(obs, mis, 200) 15614.2 15621.2 15784.9 15628.2 15870.3 16112.4     3
     matchindex(obs, mis, 200)    28.3    28.8    30.1    29.3    31.0    32.8     3

Use the pmmc branch to replicate. A new dev argument to mice allows you to replace matcher() by matchindex().

For datasets under a 1000 rows, you won’t notice a big difference in speed. Speed gains are good, e.g. by a factor 2-3. However, for datasets with over 5000 records, the speed gains can be spectacular. For example, when overimputing a dataset with 10,000 rows predictive mean matching with matchindex() is about 150 times faster than matcher(). Predictive mean matching now approaches the speed of imputation under the normal model.

> library(mice)
> set.seed(1)
> walk <- data.frame(lapply(walking, as.numeric))[, 1:4]
> dim(walk)
[1] 890   4
> microbenchmark::microbenchmark(mice(walk, method = "pmm", print = FALSE),
                                mice(walk, method = "pmm", print = FALSE, dev = TRUE),
                                mice(walk, method = "norm", print = FALSE),
                                times = 20)
Unit: milliseconds
                                                  expr   min    lq  mean median  uq max neval
             mice(walk, method = "pmm", print = FALSE) 239.6 245.1 251.4    248 255 281    20
 mice(walk, method = "pmm", print = FALSE, dev = TRUE) 102.3 104.8 106.7    106 107 114    20
            mice(walk, method = "norm", print = FALSE)  91.8  94.5  98.7     96 100 115    20
> data <- fdgs[, c(3, 5:6)]
> dim(data)
[1] 10030     3
> where <- make.where(data, "all")
> microbenchmark::microbenchmark(mice(data, where = where, print = FALSE),
                                mice(data, where = where, print = FALSE, dev = TRUE),
                                mice(data, where = where, method = "norm", print = FALSE),
                                times = 3)
Unit: milliseconds
                                                      expr    min     lq   mean median     uq    max neval
                  mice(data, where = where, print = FALSE) 106890 107200 107352 107510 107583 107656     3
      mice(data, where = where, print = FALSE, dev = TRUE)    713    743    765    774    792    809     3
 mice(data, where = where, method = "norm", print = FALSE)    489    493    500    497    505    513     3

Thanks @Polkas for bringing these large gains to my attention. If you have any suggestions for further improvements, please let me know.

I expect that new algorithm will have the same statistical properties as the old one, but we still need to look into this and confirm proper behaviour.

Polkas commented 4 years ago

Great to see this improvement. This put a mice PMM on another level. I will try to test it and provide some feedback shortly.

gerkovink commented 4 years ago

Awesome! Will do some testing.

stefvanbuuren commented 4 years ago

I calculated part of table 3.3 from FIMD for the new matchindex routine. The results are as follows:

Method		Bias	% Bias	Coverage	CI Width	RMSE
Missing y, n = 50	d
`pmm`	1	0.016	5.4	0.884	0.252	0.071
`pmm-dev`	1	0.016	5.7	0.892	0.254	0.073
`pmm`	3	0.028	9.7	0.890	0.242	0.070
`pmm-dev`	3	0.026	8.9	0.894	0.244	0.071
`pmm`	5	0.039	13.6	0.876	0.241	0.075
`pmm-dev`	5	0.037	12.8	0.884	0.243	0.074
`pmm`	10	0.065	22.4	0.806	0.245	0.089
`pmm-dev`	10	0.063	21.6	0.807	0.243	0.087

Missing x
`pmm`	1	-0.002	0.8	0.916	0.223	0.063
`pmm-dev`	1	-0.001	0.3	0.916	0.224	0.064
`pmm`	3	0.002	0.9	0.931	0.228	0.061
`pmm-dev`	3	0.004	1.3	0.930	0.224	0.063
`pmm`	5	0.008	2.8	0.938	0.237	0.062
`pmm-dev`	5	0.010	3.4	0.941	0.238	0.064
`pmm`	10	0.028	9.6	0.946	0.261	0.067
`pmm-dev`	10	0.028	9.6	0.935	0.254	0.068
Listwise deletion		0.000	0.0	0.946	0.251	0.063

Missing y, n = 1000	d
`pmm`	1	0.001	0.2	0.929	0.056	0.014
`pmm-dev`	1	0.001	0.3	0.960	0.068	0.014
`pmm`	3	0.001	0.4	0.950	0.056	0.013
`pmm-dev`	3	0.001	0.4	0.950	0.056	0.013
`pmm`	5	0.002	0.6	0.951	0.055	0.013
`pmm-dev`	5	0.002	0.7	0.944	0.056	0.013
`pmm`	10	0.003	1.2	0.932	0.054	0.013
`pmm-dev`	10	0.003	1.2	0.937	0.055	0.013

Missing x
`pmm`	1	0.000	0.2	0.926	0.041	0.011
`pmm-dev`	1	-0.002	0.6	0.988	0.085	0.013
`pmm`	3	0.000	0.1	0.933	0.041	0.011
`pmm-dev`	3	-0.001	0.2	0.961	0.056	0.012
`pmm`	5	0.000	0.1	0.937	0.042	0.011
`pmm-dev`	5	-0.000	0.1	0.952	0.049	0.012
`pmm`	10	0.000	0.1	0.928	0.042	0.011
`pmm-dev`	10	-0.000	0.0	0.935	0.044	0.011
Listwise deletion		0.000	0.1	0.955	0.050	0.012

For small samples, pmm (current default) and pmm-dev (new method) perform quite similar. For large samples, pmm-dev seems to be even a little better calibrated than pmm, and mostly erring on the conservative side. Thus, apart from a boost in speed, these results suggests that pmm-dev is also superior in terms of statistical quality.

stefvanbuuren commented 4 years ago

Commit 4df1f3d incorporates the pmm speedup in mice 3.11.8.

amices / mice

Faster PMM by pre-sorting and binary search #236

New matchindex() function