Closed tnaake closed 3 years ago
Hi Thomas,
thanks for taking the time and digging into the issue :)
However, I took a brief look at the diff (https://github.com/const-ae/proDA/pull/8/files) and it is very hard to tell what has actually changed, because the indentation seems off. This makes it basically impossible review the PR, because I can't tell what are the relevant changes. Would it be possible to make a new PR where you only change the relevant lines?
Best, Constantin
Hi again,
sorry for the confusion with the coding style - hope it's more readable now. I mainly changed such that within the functions reused objects are not recalculated, with this we got slightly better performance (my idea is that when pd_lm.fit
is run > 1000 times we get a better performance, otherwise the proposed changes are negligible.
benchmark for grad_fnc
microbenchmark(
fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance),
fct2 = grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## fct1 34.6 36.2 142.750 36.7 37.75 10505.2 100 a
## fct2 34.4 35.4 36.992 36.2 37.20 70.6 100 a
benchmark for dt.scaled
microbenchmark(
sc1 = proDA:::dt.scaled(x, df),
sc2 = dt.scaled(x, df))
##Unit: microseconds
##expr min lq mean median uq max neval cld
##sc1 4.7 4.8 5.661 4.85 5.1 40.5 100 a
##sc2 4.4 4.5 4.788 4.60 4.8 11.4 100 a
benchmark for objective_fnc
microbenchmark(
obj1 = proDA:::objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance),
obj2 = objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## obj1 11.8 12.55 97.466 13.90 16.95 8189.4 100 a
## obj2 11.7 12.20 15.389 12.75 14.75 41.1 100 a
benchmark for hess_fnc
*here actually the implemented version is faster, why?)
microbenchmark(
hess2 = hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance, beta_sel, p),
hess1 = proDA:::hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance, beta_sel, p),
times = 10000
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## hess2 87.4 104.1 129.0845 110.6 138.2 16117.3 10000 b
##hess1 83.0 97.5 119.2408 103.5 127.4 4272.9 10000 a
microbenchmark for pd_lm.fit
(started with a difference of +0.07 now it is faster, I didnt use proDA:::pd_lm.fit here but compared it to pd_lm.fit <- proDA:::pd_lm.fit
, strangely these give slightly different times)
microbenchmark(
fit1 = pd_lm.fit_old(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE),
fit2 = pd_lm.fit_new(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE)
)
##Unit: milliseconds
## expr min lq mean median uq max neval cld
## fit1 1.3728 1.58320 1.815087 1.68075 1.82435 4.2162 100 a
## fit2 1.3714 1.56735 1.801352 1.67975 1.90655 5.1906 100 a
This is great. Thanks, I will give more detailed feedback on the code later.
I compiled a quick overview table of the performance improvements:
function | Median (old / new) | Mean (old / new) |
---|---|---|
objective_fnc() | 13.9 / 12.8 µs | 97 / 12.8 µs |
grad_fnc() | 36.7 / 36.2 µs | 142.8 / 37.0 µs |
hess_fnc() | 103.5 / 110.6 µs | 119.2 / 129.1 µs |
pd_lm.fit() | 1.68 / 1.68 ms | 1.81 / 1.80 ms |
dt.scaled() | 4.9 / 4.6 µs | 5.6 / 4.8 µs |
1) Do you know where the big difference between the median and mean timings for the old code comes from? Could it be related to garbage collection? Do you see the same pattern with the bench::mark()
function?
2) How come the performance improvements for the objective_fnc
, grad_fnc
does not translate into a faster pd_lm.fit()
? Are those actually the bottlenecks for execution speed if you look at the flamegraph with profvis
?
bench::mark
fully, but I posted it underneath. It seems there is also something going on with memory allocation (if I understand correctly), but here I don't know if this is due to assigning objects within the functions or some other effects. If I understand correctly the output of bench::mark
all new functions (including hess_fnc
this time) are faster.
Please see the results here:
bench::mark
for objective_fnc
bench::mark(
obj1 = proDA:::objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance),
obj2 = objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance))
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 obj1 10.9us 13.9us 59740. 0B 5.97 9999 1 167ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~
2 obj2 10.6us 12.6us 70563. 78.6KB 7.06 9999 1 142ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~
bench::mark
for grad_fnc
bench::mark(
fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance),
fct2 = grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance)
)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 fct1 21.3us 25.9us 34504. 0B 3.45 9999 1 290ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~
2 fct2 18us 21.3us 41541. 350KB 4.15 9999 1 241ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~
bench::mark
for hess_fnc
bench::mark(
hess1 = proDA:::hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance, beta_sel, p),
hess2 = hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance, beta_sel, p)
)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 hess1 31.9us 38.6us 24452. 0B 2.45 9999 1 409ms <dbl[,3] [~ <Rprofmem[,3]~ <bch:tm [1~ <tibble [10,~
2 hess2 29.5us 33.2us 28004. 0B 2.80 9999 1 357ms <dbl[,3] [~ <Rprofmem[,3]~ <bch:tm [1~ <tibble [10,~
bench::mark
for pd_lm.fit
bench::mark(
fit1 = pd_lm.fit_old(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE),
fit2 = pd_lm.fit(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE)
)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 fit1 707us 838us 1134. 0B 4.99 455 2 401ms <named lis~ <Rprofmem[,3] [0~ <bch:tm [~ <tibble [45~
2 fit2 691us 789us 1168. 1.53MB 2.06 568 1 486ms <named lis~ <Rprofmem[,3] [3~ <bch:tm [~ <tibble [56~
bench::mark
for dt.scaled
bench::mark(
sc1 = proDA:::dt.scaled(x, df),
sc2 = dt.scaled(x, df))
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 sc1 3us 3.8us 246649. 0B 0 10000 0 40.5ms <dbl [11~ <Rprofmem[,3] ~ <bch:tm [10~ <tibble [10,0~
2 sc2 2.6us 3.2us 266230. 0B 0 10000 0 37.6ms <dbl [11~ <Rprofmem[,3] ~ <bch:tm [10~ <tibble [10,0~
cmp
, h
and alike. I didnt look into those functions - but I assume that these functions are optimized already. Thank you so much for the detailed report :)
If I understand correctly the output of bench::mark all new functions (including hess_fnc this time) are faster.
Yes, they really are faster, which is great :)
It seems there is also something going on with memory allocation (if I understand correctly), but here I don't know if this is due to assigning objects within the functions or some other effects.
The memory allocation is interesting and I would assume it is due to the explicit creation of objects like here and here, but it's a bit difficult to tell due to the byte code optimization that R applies (see below).
There is actually some time spent on these functions (see here for the profvis output: https://github.com/tnaake/proDA_tmp/blob/main/profile.html It seems however, that most of the time is spent on cmp, h and alike. I didnt look into those functions - but I assume that these functions are optimized already.
This is very useful. Thanks for posting.
The profvis output is quite illuminating, because for some reason R spends 50% of the time compiling functions before actually using them. There is a good explanation in the FAQ of profvis on compiler::tryCmpfun
(https://rstudio.github.io/profvis/faq.html#what-does-cmpfun-mean).
That probably also explains why the mean estimates from the microbenchmark
are biased and it is more helpful to look at the median.
For me this looks quite convincing now and the code is also in a much better state (thanks again for cleaning that up). I will go through the code in detail tomorrow and then will merge the PR.
Great! I am however not sure if the function hess_fnc
should be included in the current state as it showed longer running time in microbenchmark::microbenchmark
(and on the other hand shorter running time with the bench::mark
function).
Another idea of me was to parallelize pd_lm.fit
within the fit_parameters_loop
function. Each pd_lm.fit
run seems quite fast, but if this function is run thousand of times for the different features (+ the several convergene rounds) times add up if run sequentially. I did not find yet a solution for this since overhead costs of parallelization seem to blow it up substantially. Do you know of any good idea to tackle this or have any ideas on this? I also tried chunking but it seems the same effects apply here, resulting in longer running times.
Hi again,
I found another way of speeding of the *_fnc
functions that use matrix multiplication: what about using the function crossprod
and tcrossprd
when you multiply (transposed) matrices, check for example out the function grad_fnc
:
The first function uses the %*%
operator for matrix calculation:
grad_fnc_wo_crossprod <- function (y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance) {
imr <- inv_mills_ratio(Xm %*% beta, rho, zetastar)
if (moderate_location) {
Xbmu <- X %*% beta - mu0
dbeta_p <- -(location_prior_df + 1) * t(X) %*% (Xbmu / (location_prior_df * sigma20 + Xbmu^2))
} else {
dbeta_p <- 0
}
Xo_t <- t(Xo)
dbeta_o <- -(Xo_t %*% Xo %*% beta - Xo_t %*% yo) / sigma2
dbeta_m <- t(Xm) %*% imr
if (moderate_variance) {
dsig2_p <- -(1 + df0/2)/sigma2 + df0 * tau20/(2 * sigma2^2) +
1/sigma2
} else {
dsig2_p <- 0
}
dsig2_o <- sum(((Xo %*% beta - yo)^2 - sigma2)/(2 * sigma2^2))
dsig2_m <- -sum((Xm %*% beta - rho)/(2 * zetastar^2) * imr)
c(dbeta_p + dbeta_o + dbeta_m, dsig2_p + dsig2_o + dsig2_m)
}
This function uses the crossprod
and tcrossprod
functions ( the lines with the comments are changed):
grad_fnc <- function (y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0,
sigma20, df0, tau20, location_prior_df, moderate_location,
moderate_variance) {
imr <- inv_mills_ratio(Xm %*% beta, rho, zetastar)
if (moderate_location) {
Xbmu <- X %*% beta - mu0
dbeta_p <- -(location_prior_df + 1) * crossprod(X, Xbmu / (location_prior_df * sigma20 + Xbmu^2))## t(X) %*% (Xbmu / (location_prior_df * sigma20 + Xbmu^2))
} else {
dbeta_p <- 0
}
#Xo_t <- t(Xo)
dbeta_o <- -(crossprod(Xo, Xo) %*% beta - crossprod(Xo, yo)) / sigma2 #Xo_t %*% Xo %*% beta - Xo_t %*% yo) / sigma2
dbeta_m <- crossprod(Xm, imr) ## t(Xm) %*% imr
if (moderate_variance) {
dsig2_p <- -(1 + df0/2)/sigma2 + df0 * tau20/(2 * sigma2^2) +
1/sigma2
} else {
dsig2_p <- 0
}
dsig2_o <- sum(((Xo %*% beta - yo)^2 - sigma2)/(2 * sigma2^2))
dsig2_m <- -sum((Xm %*% beta - rho)/(2 * zetastar^2) * imr)
c(dbeta_p + dbeta_o + dbeta_m, dsig2_p + dsig2_o + dsig2_m)
}
Benchmarking with bench::mark
and microbenchmark
:
bench::mark(
fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance),
fct2 = grad_fnc_wo_crossprod(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance),
fct3 = grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance)
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 fct1 20.3us 25.5us 35843. 0B 3.58 9999 1 279ms <dbl [~ <Rprofme~ <bch:t~ <tibble~
2 fct2 17.6us 20.8us 45802. 0B 0 10000 0 218ms <dbl [~ <Rprofme~ <bch:t~ <tibble~
3 fct3 15.5us 19.5us 48733. 0B 4.87 9999 1 205ms <dbl [~ <Rprofme~ <bch:t~ <tibble~
microbenchmark(
fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance),
fct2 = grad_fnc_wo_crossprod(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance),
fct3 = grad_fnc(y, yo, X, Xm, Xo,
beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
location_prior_df, moderate_location, moderate_variance)
)
Unit: microseconds
expr min lq mean median uq max neval cld
fct1 21.6 22.40 23.427 22.9 23.7 39.6 100 a
fct2 19.4 19.95 20.837 20.3 20.9 38.2 100 a
fct3 17.1 17.95 258.250 18.3 19.0 23949.5 100 a
I didnt push this yet since I wanted to discuss this with you first. Arguably, the readibility slightly decreases, but we could for instance keep the old %*%
notation as a comment.
I found another way of speeding of the _fnc functions that use matrix multiplication: what about using the function crossprod and tcrossprd when you multiply (transposed) matrices. [...] I didnt push this yet since I wanted to discuss this with you first. Arguably, the readibility slightly decreases, but we could for instance keep the old %% notation as a comment.
Yes good call. Indeed tcrossprod()
is faster than %*%
, but I would not include this, because I find the function very difficult to read and the speed benefit not large enough (~10% faster) to be worth it.
Another idea of me was to parallelize pd_lm.fit within the fit_parameters_loop function. [...] Do you know of any good idea to tackle this or have any ideas on this? I also tried chunking but it seems the same effects apply here, resulting in longer running times.
I assume you mean replacing for example the lapply()
in https://github.com/const-ae/proDA/blob/master/R/proDA.R#L369 with for example a mclapply()
or bplapply()
?
I think this could be quite powerful, but I am no expert on this either. I know that Bioconductor favors the use of the BiocParallel package. It comes with an extensive documentation. One important thing to keep in mind is that the performance and start-up cost of parallelized code can differ significantly with the OS (e.g. https://github.com/Bioconductor/BiocParallel/issues/98). Thus introducing this would need testing across Mac, Linux and Windows.
I have run now the unit tests for pd_lm
and the functions therein. Everything passes now without complaints (I had to add the c()
in grad_fnc
and added another c()
around some expression, even though it should be fine without (the first one was needed, so thanks for spotting!).
Could you have a look at the changes? I guess the parallelization could go into another PR (I think finding an efficient way of parallelization in Windows will be challenging, but I will have another dive into this).
Sorry for the delay. I just looked at the code and it seems all fine. But, I get a WARNING when I run the unit tests that wasn't there previously. Specifically, this line (https://github.com/const-ae/proDA/blob/master/tests/testthat/test-test_diff.R#L120) generates a warning in the pd_lm.fit()
function. Could you investigate, if your changes might have altered the fitting procedure or where the warning is coming from?
Hi @const-ae
when I run the mentioned lines I receive the following output (the message) for the line (https://github.com/const-ae/proDA/blob/master/tests/testthat/test-test_diff.R#L120):
F test: comparing the reduced model with the original model.
Setting verbose
in test_diff
to FALSE
does not print the message.
Is this the 'warning' you mentioned? Or could you post here the warning that is generated under your OS/system?
Hm, that's weird. I get the following:
✓ | 44 1 | test-test_diff [21.9 s]
────────────────────────────────────────────────────────────────
Warning (test-test_diff.R:119:3): F test works with missing data
NA/NaN function evaluation
Backtrace:
1. proDA::proDA(se, ~cond + num, moderate_location = TRUE, moderate_variance = TRUE) test-test_diff.R:119:2
2. proDA::fit_parameters_loop(...) /Users/ahlmanne/prog/r_packages/proDA/R/proDA.R:248:2
3. base::lapply(...) /Users/ahlmanne/prog/r_packages/proDA/R/proDA.R:369:4
4. proDA:::FUN(X[[i]], ...)
5. proDA::pd_lm.fit(...) /Users/ahlmanne/prog/r_packages/proDA/R/proDA.R:370:6
6. stats::nlminb(...) /Users/ahlmanne/prog/r_packages/proDA/R/pd_lm.R:287:4
It is possible that those errors, especially the ones where suddenly random NA
's occur, happen only on some operating systems, but of course that makes it so much harder to debug. Often the same problems occur on Mac and Linux, but as you are on Windows, I will look into the problem.
I have debugged the problem and it comes from nlminb()
calling the function with sigma2 = 0
(here). The reason why the bug didn't occur for you is that it only sometimes appears (with seed(1)
no warning, with seed(2)
the warning is triggered). However, the bug didn't just affect your code, but was also present in my original code. The only difference was the seed that triggered the issue.
I have now merged your code :) Thanks again a lot for your help :)
Hi @const-ae
I had a look on the
lm_pd.fit
function, since it seems this causes the speed limitations (it is called several times in the functionfit_parameters_loop
, especially inlapply
). By itself, the function is very fast and I don't think we can speed it up much more (especially the optimization functions that are already written in C).I found some minor issues that make it slightly faster, maybe in the sum then it will run a bit faster (at least my
microbenchmark
checks showed some improvement in computation time).I was also thinking of parallelizing the
lapply
that uses thelm_pd.fit
, but so far it seems making up the cluster takes more time (since every fitting bylm_pd.fit
is quite fast), maybe chunking will help but I will have a look on this.