const-ae / proDA

Protein Differential Abundance for Label-Free Mass Spectrometry https://const-ae.github.io/proDA/
18 stars 8 forks source link

improve speed of function fit_parameters_loop #9

Open tnaake opened 3 years ago

tnaake commented 3 years ago

Hi @const-ae

following the same logic of pull request #8 I had another look at the function fit_parameters_loop and improved slightly the performance of the function. It is now slightly faster than the original version (see below, I hope it is worth checking).

However, there is a problem, which doesn't allow to compare directly if the two versions yield the same result, see:

 bench::mark(
    x1 = proDA:::fit_parameters_loop(Y = sub_sample_mat, model_matrix = model_matrix, 
                         location_prior_df = 3, moderate_location = TRUE,
                         moderate_variance = TRUE, max_iter = 20, epsilon = 1e-03),
    x2 = proDA:::fit_parameters_loop(Y = sub_sample_mat, model_matrix = model_matrix, 
                          location_prior_df = 3, moderate_location = TRUE,
                          moderate_variance = TRUE, max_iter = 20, epsilon = 1e-03)

Error: Each result must equal the first result:
`x1` does not equal `x2`

I guess this is doe to some random number generation!

For now, see bench::mark with check=FALSE :

 bench::mark(
    x1 = proDA:::fit_parameters_loop(Y = sub_sample_mat, model_matrix = model_matrix, 
                                 location_prior_df = 3, moderate_location = TRUE,
                                 moderate_variance = TRUE, max_iter = 20, epsilon = 1e-03), 
    x2 = fit_parameters_loop(Y = sub_sample_mat, model_matrix = model_matrix, 
                                location_prior_df = 3, moderate_location = TRUE,
                                moderate_variance = TRUE, max_iter = 20, epsilon = 1e-03), 
    max_iterations = 10, check = FALSE)

  # A tibble: 2 x 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory          time    gc       
   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>          <list>  <list>   
 1 x1             48s     48s    0.0209     954MB     3.00     1   144        48s <NULL> <Rprofmem[,3] ~ <bch:t~ <tibble ~
 2 x2           46.5s   46.5s    0.0215     950MB     2.21     1   103      46.5s <NULL> <Rprofmem[,3] ~ <bch:t~ <tibble ~

Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

Short script to generate the input variables:

library(proDA)
data <- generate_synthetic_data(n_proteins = 3000, n_conditions = 2)
design <- data$groups
data <- data$Y
col_data <- NULL
reference_level <- NULL
data_is_log_transformed <- TRUE
n_subsample = nrow(data)

if (!is.matrix(data) && !is(data, "SummarizedExperiment")) {
    if (canCoerce(data, "SummarizedExperiment")) {
        data <- as(data, "SummarizedExperiment")
    } else {
        stop("Cannot handle data of class", class(data), 
             ". It must be of type", "matrix or it should be possible to cast it to SummarizedExperiment")
    }
}
n_samples <- ncol(data)
n_rows <- nrow(data)
if (is.matrix(design)) {
    model_matrix <- design
    design_formula <- NULL
} else if ((is.vector(design) || is.factor(design))) {
    if (length(design) != n_samples) {
        stop(paste0("The specified design vector length (", 
                    length(design), ") does not match ", "the number of samples: ", 
                    n_samples))
    }
    model_matrix <- proDA:::convert_chr_vec_to_model_matrix(design, 
                                                    reference_level)
    design_formula <- NULL
} else if (inherits(design, "formula")) {
    if (design == formula(~1) && is.null(col_data)) {
        col_data <- as.data.frame(matrix(numeric(0), nrow = n_samples))
    }
    compl_col_data <- if (is(data, "SummarizedExperiment")) {
        if (is.null(col_data)) {
            colData(data)
        } else cbind(col_data, colData(data))
    } else {
        col_data
    }
    model_matrix <- convert_formula_to_model_matrix(design, 
                                                    compl_col_data, reference_level)
    design_formula <- design
} else {
    stop(paste0("design argment of class ", class(design), 
                " is not supported. Please ", "specify a `model_matrix`, a `character vector`, or a `formula`."))
}
rownames(model_matrix) <- colnames(data)

if (is.matrix(data)) {
    if (any(!is.na(data) & data == 0)) {
        if (any(is.na(data))) {
            warning(paste0("The data contains a mix of ", 
                           sum(!is.na(data) & data == 0), " exact zeros ", 
                           "and ", sum(is.na(data)), " NA's. Will treat the zeros as valid input and not replace them with NA's."))
        } else {
            warning(paste0("The data contains ", sum(!is.na(data) & 
                                                         data == 0), " exact zeros and no NA's.", 
                           "Replacing all exact zeros with NA's."))
            data[!is.na(data) & data == 0] <- NA
        }
    }
    if (!data_is_log_transformed) {
        data <- log2(data)
    }
    data_mat <- data
} else if (is(data, "SummarizedExperiment")) {
    if (any(!is.na(assay(data)) & assay(data) == 0)) {
        if (any(is.na(assay(data)))) {
            warning(paste0("The data contains a mix of ", 
                           sum(!is.na(assay(data)) & assay(data) == 0), 
                           " exact zeros ", "and ", sum(is.na(assay(data))), 
                           " NA's. Will treat the zeros as valid input and not replace them with NA's."))
        } else {
            warning(paste0("The data contains ", sum(!is.na(assay(data)) & 
                                                         assay(data) == 0), " exact zeros and no NA's.", 
                           "Replacing all exact zeros with NA's."))
            assay(data)[!is.na(assay(data)) & assay(data) == 
                            0] <- NA
        }
    }
    if (!data_is_log_transformed) {
        assay(data) <- log2(assay(data))
    }
    data_mat <- assay(data)
    assays(data)[seq_len(length(assays(data)) - 1) + 1] <- NULL
} else {
    stop("data of tye ", class(data), " is not supported.")
}

n_subsample <- min(nrow(data_mat), n_subsample)
sub_sample_mat <- data_mat[seq_len(n_subsample), , drop = FALSE]

Surprisingly, these updates (I guess most comes from the lapply changes result in the considerable speed improvements.

const-ae commented 3 years ago

Hi Thomas,

thanks. I'll take a look right away.


I am a bit confused, because first you say

I had another look at the function fit_parameters_loop and improved slightly the performance of the function

but in the end

these updates [...] result in the considerable speed improvements.

Both times, you talk about the benchmark (2 seconds / 5% speed-up), right?


However, there is a problem, which doesn't allow to compare directly if the two versions yield the same result, see [...] I guess this is doe to some random number generation!

I am a bit concerned about this, can you investigate this a bit more and quantify how much the results change?


I'll also make a few inline comments about some specific things.

Best, Constantin

tnaake commented 3 years ago

I have found another bunch of code that could be slightly optimised:

res_init <- lapply(seq_len(nrow(Y)), function(i){
     pd_lm.fit(Y_compl[i, ], model_matrix,
               dropout_curve_position = rep(NA, n_samples),
               dropout_curve_scale =rep(NA, n_samples),
               verbose=verbose)
})

could become:

rep_NA <- rep(NA, n_samples)
res_init <- lapply(seq_len(nrow(Y)), function(i){
      pd_lm.fit(Y_compl[i, ], model_matrix,
                dropout_curve_position = rep_NA,
                dropout_curve_scale =rep_NA,
                verbose=verbose)
 })

microbenchmark the change results in slight decrease:

microbenchmark::microbenchmark(

   res_init = {
      rep_NA = rep(NA, n_samples);
      lapply(split_Y_compl, function(i){
      pd_lm.fit(i, model_matrix,
              dropout_curve_position = rep_NA,
              dropout_curve_scale = rep_NA,
              verbose=verbose)
     })},

   res_init2 = lapply(split_Y_compl, function(i){
       pd_lm.fit(i, model_matrix,
              dropout_curve_position = rep(NA, n_samples),
              dropout_curve_scale = rep(NA, n_samples),
              verbose=verbose)
 })
)

Unit: seconds
  expr      min       lq     mean   median       uq      max neval cld
 res_init 2.490822 2.743049 2.980568 2.896272 3.209700 4.115103   100  a 
 res_init2 2.519873 2.766047 3.080926 2.958154 3.381294 4.317780   100   b