speed up lm_pd.fit - Githubissues

tnaake commented 3 years ago

Hi @const-ae

I had a look on the lm_pd.fit function, since it seems this causes the speed limitations (it is called several times in the function fit_parameters_loop, especially in lapply). By itself, the function is very fast and I don't think we can speed it up much more (especially the optimization functions that are already written in C).

I found some minor issues that make it slightly faster, maybe in the sum then it will run a bit faster (at least my microbenchmark checks showed some improvement in computation time).

I was also thinking of parallelizing the lapply that uses the lm_pd.fit, but so far it seems making up the cluster takes more time (since every fitting by lm_pd.fit is quite fast), maybe chunking will help but I will have a look on this.

const-ae commented 3 years ago

Hi Thomas,

thanks for taking the time and digging into the issue :)

However, I took a brief look at the diff (https://github.com/const-ae/proDA/pull/8/files) and it is very hard to tell what has actually changed, because the indentation seems off. This makes it basically impossible review the PR, because I can't tell what are the relevant changes. Would it be possible to make a new PR where you only change the relevant lines?

Best, Constantin

tnaake commented 3 years ago

Hi again,

sorry for the confusion with the coding style - hope it's more readable now. I mainly changed such that within the functions reused objects are not recalculated, with this we got slightly better performance (my idea is that when pd_lm.fit is run > 1000 times we get a better performance, otherwise the proposed changes are negligible.

benchmark for grad_fnc

microbenchmark(
    fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
                 beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                 location_prior_df, moderate_location, moderate_variance),
    fct2 = grad_fnc(y, yo, X, Xm, Xo,
          beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
          location_prior_df, moderate_location, moderate_variance)
)

## Unit: microseconds
## expr  min   lq    mean median    uq     max neval cld
## fct1 34.6 36.2 142.750   36.7 37.75 10505.2   100   a
## fct2 34.4 35.4  36.992   36.2 37.20    70.6   100   a

benchmark for dt.scaled

microbenchmark(
    sc1 = proDA:::dt.scaled(x, df), 
    sc2 = dt.scaled(x, df))

##Unit: microseconds
##expr min  lq  mean median  uq  max neval cld
##sc1 4.7 4.8 5.661   4.85 5.1 40.5   100   a
##sc2 4.4 4.5 4.788   4.60 4.8 11.4   100   a

benchmark for objective_fnc

microbenchmark(
    obj1 = proDA:::objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                             sigma20, df0, tau20, location_prior_df, moderate_location, 
                             moderate_variance),
    obj2 = objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                     sigma20, df0, tau20, location_prior_df, moderate_location, 
                     moderate_variance)
)

## Unit: microseconds
## expr  min    lq   mean median    uq    max neval cld
## obj1 11.8 12.55 97.466  13.90 16.95 8189.4   100   a
## obj2 11.7 12.20 15.389  12.75 14.75   41.1   100   a

tnaake commented 3 years ago

benchmark for hess_fnc *here actually the implemented version is faster, why?)

microbenchmark(
    hess2 = hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                 sigma20, df0, tau20, location_prior_df, moderate_location, 
                 moderate_variance, beta_sel, p),
    hess1 = proDA:::hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                 sigma20, df0, tau20, location_prior_df, moderate_location, 
                 moderate_variance, beta_sel, p),
     times = 10000
)

## Unit: microseconds
## expr  min    lq     mean median    uq     max neval cld
## hess2 87.4 104.1 129.0845  110.6 138.2 16117.3 10000   b
##hess1 83.0  97.5 119.2408  103.5 127.4  4272.9 10000  a

microbenchmark for pd_lm.fit (started with a difference of +0.07 now it is faster, I didnt use proDA:::pd_lm.fit here but compared it to pd_lm.fit <- proDA:::pd_lm.fit, strangely these give slightly different times)

microbenchmark(
    fit1 = pd_lm.fit_old(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
        variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE),
    fit2 = pd_lm.fit_new(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
                     variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE)
)

##Unit: milliseconds
## expr    min      lq     mean  median      uq    max neval cld
## fit1 1.3728 1.58320 1.815087 1.68075 1.82435 4.2162   100   a
## fit2 1.3714 1.56735 1.801352 1.67975 1.90655 5.1906   100   a

const-ae commented 3 years ago

This is great. Thanks, I will give more detailed feedback on the code later.

I compiled a quick overview table of the performance improvements:

function	Median (old / new)	Mean (old / new)
objective_fnc()	13.9 / 12.8 µs	97 / 12.8 µs
grad_fnc()	36.7 / 36.2 µs	142.8 / 37.0 µs
hess_fnc()	103.5 / 110.6 µs	119.2 / 129.1 µs
pd_lm.fit()	1.68 / 1.68 ms	1.81 / 1.80 ms
dt.scaled()	4.9 / 4.6 µs	5.6 / 4.8 µs

1) Do you know where the big difference between the median and mean timings for the old code comes from? Could it be related to garbage collection? Do you see the same pattern with the bench::mark() function?

2) How come the performance improvements for the objective_fnc, grad_fnc does not translate into a faster pd_lm.fit()? Are those actually the bottlenecks for execution speed if you look at the flamegraph with profvis?

tnaake commented 3 years ago

No clear idea. It seems there are some test runs that are quite distorted, which will increase heavily the mean values. I don't know if this is due to gc. Also, I cannot interpret the output of bench::mark fully, but I posted it underneath. It seems there is also something going on with memory allocation (if I understand correctly), but here I don't know if this is due to assigning objects within the functions or some other effects.

If I understand correctly the output of bench::mark all new functions (including hess_fnc this time) are faster.

Please see the results here:

bench::mark for objective_fnc

bench::mark(
    obj1 = proDA:::objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                             sigma20, df0, tau20, location_prior_df, moderate_location, 
                             moderate_variance),
    obj2 = objective_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                     sigma20, df0, tau20, location_prior_df, moderate_location, 
                     moderate_variance))

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result  memory           time         gc            
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  <list>           <list>       <list>        
1 obj1         10.9us   13.9us    59740.        0B     5.97  9999     1      167ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~
2 obj2         10.6us   12.6us    70563.    78.6KB     7.06  9999     1      142ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~

bench::mark for grad_fnc

bench::mark(
fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
                        beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                        location_prior_df, moderate_location, moderate_variance),
fct2 = grad_fnc(y, yo, X, Xm, Xo,
                beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                location_prior_df, moderate_location, moderate_variance)
)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result  memory           time         gc            
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  <list>           <list>       <list>        
1 fct1         21.3us   25.9us    34504.        0B     3.45  9999     1      290ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~
2 fct2           18us   21.3us    41541.     350KB     4.15  9999     1      241ms <dbl [~ <Rprofmem[,3] [~ <bch:tm [10~ <tibble [10,0~

bench::mark for hess_fnc

bench::mark(
    hess1 = proDA:::hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                         sigma20, df0, tau20, location_prior_df, moderate_location, 
                         moderate_variance, beta_sel, p),
    hess2 = hess_fnc(y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                         sigma20, df0, tau20, location_prior_df, moderate_location, 
                        moderate_variance, beta_sel, p)
)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result      memory         time        gc           
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>      <list>         <list>      <list>       
1 hess1        31.9us   38.6us    24452.        0B     2.45  9999     1      409ms <dbl[,3] [~ <Rprofmem[,3]~ <bch:tm [1~ <tibble [10,~
2 hess2        29.5us   33.2us    28004.        0B     2.80  9999     1      357ms <dbl[,3] [~ <Rprofmem[,3]~ <bch:tm [1~ <tibble [10,~

bench::mark for pd_lm.fit

bench::mark(
     fit1 = pd_lm.fit_old(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
                       variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE),
     fit2 = pd_lm.fit(y, X, dropout_curve_position, dropout_curve_scale, location_prior_mean, location_prior_scale,
                     variance_prior_scale, variance_prior_df, location_prior_df, method = "analytic_hessian", verbose = FALSE)
)
# A tibble: 2 x 13
  expression     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result      memory            time       gc          
  <bch:expr> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>      <list>            <list>     <list>      
1 fit1         707us    838us     1134.        0B     4.99   455     2      401ms <named lis~ <Rprofmem[,3] [0~ <bch:tm [~ <tibble [45~
2 fit2         691us    789us     1168.    1.53MB     2.06   568     1      486ms <named lis~ <Rprofmem[,3] [3~ <bch:tm [~ <tibble [56~

bench::mark for dt.scaled

bench::mark(
    sc1 = proDA:::dt.scaled(x, df), 
    sc2 = dt.scaled(x, df))

# A tibble: 2 x 13
  expression     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory          time         gc            
  <bch:expr> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>          <list>       <list>        
1 sc1            3us    3.8us   246649.        0B        0 10000     0     40.5ms <dbl [11~ <Rprofmem[,3] ~ <bch:tm [10~ <tibble [10,0~
2 sc2          2.6us    3.2us   266230.        0B        0 10000     0     37.6ms <dbl [11~ <Rprofmem[,3] ~ <bch:tm [10~ <tibble [10,0~

Yes and no. There is actually some time spent on these functions (see here for the profvis output: https://github.com/tnaake/proDA_tmp/blob/main/profile.html It seems however, that most of the time is spent on cmp, h and alike. I didnt look into those functions - but I assume that these functions are optimized already.

const-ae commented 3 years ago

Thank you so much for the detailed report :)

If I understand correctly the output of bench::mark all new functions (including hess_fnc this time) are faster.

Yes, they really are faster, which is great :)

It seems there is also something going on with memory allocation (if I understand correctly), but here I don't know if this is due to assigning objects within the functions or some other effects.

The memory allocation is interesting and I would assume it is due to the explicit creation of objects like here and here, but it's a bit difficult to tell due to the byte code optimization that R applies (see below).

There is actually some time spent on these functions (see here for the profvis output: https://github.com/tnaake/proDA_tmp/blob/main/profile.html It seems however, that most of the time is spent on cmp, h and alike. I didnt look into those functions - but I assume that these functions are optimized already.

This is very useful. Thanks for posting. The profvis output is quite illuminating, because for some reason R spends 50% of the time compiling functions before actually using them. There is a good explanation in the FAQ of profvis on compiler::tryCmpfun (https://rstudio.github.io/profvis/faq.html#what-does-cmpfun-mean). That probably also explains why the mean estimates from the microbenchmark are biased and it is more helpful to look at the median.

For me this looks quite convincing now and the code is also in a much better state (thanks again for cleaning that up). I will go through the code in detail tomorrow and then will merge the PR.

tnaake commented 3 years ago

Great! I am however not sure if the function hess_fnc should be included in the current state as it showed longer running time in microbenchmark::microbenchmark (and on the other hand shorter running time with the bench::mark function).

Another idea of me was to parallelize pd_lm.fit within the fit_parameters_loop function. Each pd_lm.fit run seems quite fast, but if this function is run thousand of times for the different features (+ the several convergene rounds) times add up if run sequentially. I did not find yet a solution for this since overhead costs of parallelization seem to blow it up substantially. Do you know of any good idea to tackle this or have any ideas on this? I also tried chunking but it seems the same effects apply here, resulting in longer running times.

tnaake commented 3 years ago

Hi again,

I found another way of speeding of the *_fnc functions that use matrix multiplication: what about using the function crossprod and tcrossprd when you multiply (transposed) matrices, check for example out the function grad_fnc:

The first function uses the %*% operator for matrix calculation:

grad_fnc_wo_crossprod <- function (y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
                      sigma20, df0, tau20, location_prior_df, moderate_location, 
                      moderate_variance) {
    imr <- inv_mills_ratio(Xm %*% beta, rho, zetastar)

    if (moderate_location) {
        Xbmu <- X %*% beta - mu0
        dbeta_p <- -(location_prior_df + 1) *  t(X) %*% (Xbmu / (location_prior_df * sigma20 + Xbmu^2))
    } else {
        dbeta_p <- 0
    }

    Xo_t <- t(Xo)
    dbeta_o <- -(Xo_t %*% Xo %*% beta - Xo_t %*% yo) / sigma2
    dbeta_m <- t(Xm) %*% imr

    if (moderate_variance) {
        dsig2_p <- -(1 + df0/2)/sigma2 + df0 * tau20/(2 * sigma2^2) + 
            1/sigma2
    } else {
        dsig2_p <- 0
    }

    dsig2_o <- sum(((Xo %*% beta - yo)^2 - sigma2)/(2 * sigma2^2))
    dsig2_m <- -sum((Xm %*% beta - rho)/(2 * zetastar^2) * imr)
    c(dbeta_p + dbeta_o + dbeta_m, dsig2_p + dsig2_o + dsig2_m)
}

This function uses the crossprod and tcrossprodfunctions ( the lines with the comments are changed):

grad_fnc <- function (y, yo, X, Xm, Xo, beta, sigma2, rho, zetastar, mu0, 
    sigma20, df0, tau20, location_prior_df, moderate_location, 
    moderate_variance) {
    imr <- inv_mills_ratio(Xm %*% beta, rho, zetastar)

    if (moderate_location) {
        Xbmu <- X %*% beta - mu0
        dbeta_p <- -(location_prior_df + 1) * crossprod(X,  Xbmu / (location_prior_df * sigma20 + Xbmu^2))## t(X) %*% (Xbmu / (location_prior_df * sigma20 + Xbmu^2))
    } else {
        dbeta_p <- 0
    }

    #Xo_t <- t(Xo)
    dbeta_o <- -(crossprod(Xo, Xo) %*% beta - crossprod(Xo, yo)) / sigma2 #Xo_t %*% Xo %*% beta - Xo_t %*% yo) / sigma2
    dbeta_m <- crossprod(Xm, imr) ## t(Xm) %*% imr

    if (moderate_variance) {
        dsig2_p <- -(1 + df0/2)/sigma2 + df0 * tau20/(2 * sigma2^2) + 
            1/sigma2
    } else {
        dsig2_p <- 0
    }

    dsig2_o <- sum(((Xo %*% beta - yo)^2 - sigma2)/(2 * sigma2^2))
    dsig2_m <- -sum((Xm %*% beta - rho)/(2 * zetastar^2) * imr)
    c(dbeta_p + dbeta_o + dbeta_m, dsig2_p + dsig2_o + dsig2_m)
}

Benchmarking with bench::mark and microbenchmark:

bench::mark(
    fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
                            beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                            location_prior_df, moderate_location, moderate_variance),
    fct2 = grad_fnc_wo_crossprod(y, yo, X, Xm, Xo,
                    beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                    location_prior_df, moderate_location, moderate_variance),
    fct3 = grad_fnc(y, yo, X, Xm, Xo,
                    beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                    location_prior_df, moderate_location, moderate_variance)
)
# A tibble: 3 x 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result  memory    time    gc      
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>  <list>    <list>  <list>  
1 fct1        20.3us  25.5us    35843.        0B     3.58  9999     1      279ms <dbl [~ <Rprofme~ <bch:t~ <tibble~
2 fct2        17.6us  20.8us    45802.        0B     0    10000     0      218ms <dbl [~ <Rprofme~ <bch:t~ <tibble~
3 fct3        15.5us  19.5us    48733.        0B     4.87  9999     1      205ms <dbl [~ <Rprofme~ <bch:t~ <tibble~

microbenchmark(
    fct1 = proDA:::grad_fnc(y, yo, X, Xm, Xo,
                     beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                     location_prior_df, moderate_location, moderate_variance),
    fct2 = grad_fnc_wo_crossprod(y, yo, X, Xm, Xo,
              beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
              location_prior_df, moderate_location, moderate_variance),
    fct3 = grad_fnc(y, yo, X, Xm, Xo,
                                 beta, sigma2, rho, zetastar, mu0, sigma20, df0, tau20,
                                 location_prior_df, moderate_location, moderate_variance)
)
Unit: microseconds
 expr  min    lq    mean median   uq     max neval cld
 fct1 21.6 22.40  23.427   22.9 23.7    39.6   100   a
 fct2 19.4 19.95  20.837   20.3 20.9    38.2   100   a
 fct3 17.1 17.95 258.250   18.3 19.0 23949.5   100   a

I didnt push this yet since I wanted to discuss this with you first. Arguably, the readibility slightly decreases, but we could for instance keep the old %*% notation as a comment.

const-ae commented 3 years ago

I found another way of speeding of the _fnc functions that use matrix multiplication: what about using the function crossprod and tcrossprd when you multiply (transposed) matrices. [...] I didnt push this yet since I wanted to discuss this with you first. Arguably, the readibility slightly decreases, but we could for instance keep the old %% notation as a comment.

Yes good call. Indeed tcrossprod() is faster than %*%, but I would not include this, because I find the function very difficult to read and the speed benefit not large enough (~10% faster) to be worth it.

Another idea of me was to parallelize pd_lm.fit within the fit_parameters_loop function. [...] Do you know of any good idea to tackle this or have any ideas on this? I also tried chunking but it seems the same effects apply here, resulting in longer running times.

I assume you mean replacing for example the lapply() in https://github.com/const-ae/proDA/blob/master/R/proDA.R#L369 with for example a mclapply() or bplapply()?

I think this could be quite powerful, but I am no expert on this either. I know that Bioconductor favors the use of the BiocParallel package. It comes with an extensive documentation. One important thing to keep in mind is that the performance and start-up cost of parallelized code can differ significantly with the OS (e.g. https://github.com/Bioconductor/BiocParallel/issues/98). Thus introducing this would need testing across Mac, Linux and Windows.

tnaake commented 3 years ago

I have run now the unit tests for pd_lm and the functions therein. Everything passes now without complaints (I had to add the c() in grad_fnc and added another c() around some expression, even though it should be fine without (the first one was needed, so thanks for spotting!).

Could you have a look at the changes? I guess the parallelization could go into another PR (I think finding an efficient way of parallelization in Windows will be challenging, but I will have another dive into this).

const-ae commented 3 years ago

Sorry for the delay. I just looked at the code and it seems all fine. But, I get a WARNING when I run the unit tests that wasn't there previously. Specifically, this line (https://github.com/const-ae/proDA/blob/master/tests/testthat/test-test_diff.R#L120) generates a warning in the pd_lm.fit() function. Could you investigate, if your changes might have altered the fitting procedure or where the warning is coming from?

tnaake commented 3 years ago

Hi @const-ae

when I run the mentioned lines I receive the following output (the message) for the line (https://github.com/const-ae/proDA/blob/master/tests/testthat/test-test_diff.R#L120): F test: comparing the reduced model with the original model.

Setting verbose in test_diff to FALSE does not print the message.

Is this the 'warning' you mentioned? Or could you post here the warning that is generated under your OS/system?

const-ae commented 3 years ago

Hm, that's weird. I get the following:

✓ |  44   1   | test-test_diff [21.9 s]                         
────────────────────────────────────────────────────────────────
Warning (test-test_diff.R:119:3): F test works with missing data
NA/NaN function evaluation
Backtrace:
 1. proDA::proDA(se, ~cond + num, moderate_location = TRUE, moderate_variance = TRUE) test-test_diff.R:119:2
 2. proDA::fit_parameters_loop(...) /Users/ahlmanne/prog/r_packages/proDA/R/proDA.R:248:2
 3. base::lapply(...) /Users/ahlmanne/prog/r_packages/proDA/R/proDA.R:369:4
 4. proDA:::FUN(X[[i]], ...)
 5. proDA::pd_lm.fit(...) /Users/ahlmanne/prog/r_packages/proDA/R/proDA.R:370:6
 6. stats::nlminb(...) /Users/ahlmanne/prog/r_packages/proDA/R/pd_lm.R:287:4

It is possible that those errors, especially the ones where suddenly random NA's occur, happen only on some operating systems, but of course that makes it so much harder to debug. Often the same problems occur on Mac and Linux, but as you are on Windows, I will look into the problem.

const-ae commented 3 years ago

I have debugged the problem and it comes from nlminb() calling the function with sigma2 = 0 (here). The reason why the bug didn't occur for you is that it only sometimes appears (with seed(1) no warning, with seed(2) the warning is triggered). However, the bug didn't just affect your code, but was also present in my original code. The only difference was the seed that triggered the issue.

const-ae commented 3 years ago

I have now merged your code :) Thanks again a lot for your help :)

const-ae / proDA

speed up lm_pd.fit #8