MarioniLab / miloDE

Framework for sensitive DE testing (using neighbourhoods)
Other
57 stars 2 forks source link

Runtime for p-value correction across neighbourhoods #27

Closed Biomiha closed 1 year ago

Biomiha commented 1 year ago

Hi,

First of all thank you so much for the package, the preprint and the tweetorial. I think it's a great idea and I am very excited to try it. One thing I've noticed is that the step of Performing p-value correction across neighbourhoods.. is taking a very long time to complete, especially compared to the other steps of the workflow. On a dataset of ~100k cells it took almost 15000s (just over 4hrs) using 8 threads in parallel. The neighbourhood assignment and DE testing within each neighbourhood completed in a few minutes, however the p-value correction step took hours, which I found surprising. Is there a particular reason that the p-value correction takes the longest and is there a way of assigning an intermediate variable before running the p-value correction to be able to QC the dataset before running the most time consuming step?

Many thanks in advance, Miha

amissarova commented 1 year ago

Hey @Biomiha, apologies for the delayed response. Yeah, (unfortunately) de_test_neighbourhoods is a costly function, and you are right - correction of p-values is the most consuming step. What happens in this part of the function, is that we need to correct p-values (for each gene, across neighbourhoods) in a spatially-aware manner, and for this, we need to compute ~density weights of each neighbourhood (essentially how strongly it overlaps with other neighbourhood). Importantly, that for different genes different neighbourhoods are tested overall, and because of this we can not calculate the weights once for all neighbourhoods - the weight calculation is specific to the tested neighbourhoods (and therefore is done for each gene). Therefore algorithm scales with number of genes which is usually of the order of few 10000s.

I realise that long computing time is quite annoying (and I will check more thoroughly if there is any part I can speed up), but I guess the comforting part is that this function is not expected to be recycled multiple times, but rather ran once, maybe few times at the max. But even so, I agree that it would be good to minimise the running time. So here are a few ideas:

  1. (You already do this) parallelise. What I do usually for big datasets (> 80k cells) is that I run this on the cluster, and submit it as a job with 16 or so cores. On the local machine I don't think you can above 8 though.
  2. You can a priori decrease the number of genes that you test. You probably wouldn't want to do HVG selection though, since there are some genes that might be DE that are not included in HVG, but a reasonable filter will be based on keeping expressed genes (under the hood, within each neighbourhoods I use edgeR::filterByExpr function and I apply it within each neighbourhood). In miloDE algorithm, before spatial p-value correction, I already discard genes that didn't undergo testing in any neighbourhood. Within each neighborhood, the parameter that controls the minimum threshold for the expression is min_count and default is 3 which is quite permissive. You can increase this (say 5-10) and in this case, you will likely have fewer genes tested -> faster spatial correction. Importantly though, I would recommend to avoid selecting few interesting genes manually and running miloDE on them since this will likely to strongly skew variance estimates.
  3. You can also decrease the number of neighbourhoods. This can be done by increasing k or decreasing number of cells you test it in. e.g. I personally like to run miloDE as a more downstream part of the analysis, when I already detected some cell types of interest for further investigation. In this case, if I also have reasons to believe that these cell types exhibit some intra-cell type heterogeneity, I run miloDE directly on cells from this cell type. Of note, this will also reduced number of expressed genes (point above).
  4. Finally, as you pointed out - the first part of algorithm runs much quicker and it would be good to discard some genes for the 2nd part, which takes a while. It is not an immediate part of the algorithm, but you could patch something like this: run DE testing on all genes, then discard genes, in which all p-values (i.e. for all neighbourhoods) are above certain threshold (right now Im only discarding untested genes), and run spatial correction on the remaining genes (you can call miloDE::spatial_pval_adjustment directly). Below I patched a code to do so - please bare in mind, that this is essentially de_test_neighbourhoods slightly torn up and re-used, and in the code below there are few checks missing - so treat it with caution (also it might be buggy, I didn't properly test it):
    
    library(miloDE)
    library(miloR)
    library(SingleCellExperiment)
    library(S4Vectors)
    library(BiocParallel)
    ncores = 4
    mcparam = MulticoreParam(workers = ncores)
    register(mcparam)

define my variables

x = sce_milo nhoods_x = nhoods(sce_milo) subset_nhoods = NULL

update subset_hoods based on the input

if (is.null(subset_nhoods)){ subset_nhoods = c(1:ncol(nhoods_x)) } else { if (is.logical(subset_nhoods)){ subset_nhoods = which(subset_nhoods) } } subset_nhoods = sort(subset_nhoods)

run DE testing for each neighbourhood

de_stat = bplapply(seq(length(subset_nhoods)) , function(i){ hood_id = subset_nhoods[i] out = de_test_single_neighbourhood(x , nhoods_x = nhoods_x, hood_id = hood_id, sample_id = "sample", design = ~tomato, covariates = c("tomato"), contrasts = NULL, min_n_cells_per_sample = 3 , min_count = 5 , run_separately = T) return(out) } , BPPARAM = mcparam)

put it together in SCE format

de_stat_sce = SingleCellExperiment(list(logFC = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)) , pval = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)) , pval_corrected_across_genes = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)))) rownames(de_stat_sce) = rownames(x) for (i in seq(length(de_stat))){ current.de_stat = de_stat[[i]] assay(de_stat_sce , "logFC")[current.de_stat$gene , i] = current.de_stat$logFC assay(de_stat_sce , "pval")[current.de_stat$gene , i] = current.de_stat$pval assay(de_stat_sce , "pval_corrected_across_genes")[current.de_stat$gene , i] = current.de_stat$pval_corrected_across_genes }

add coldata on nhoods

meta_nhoods = lapply(seq(length(de_stat)) , function(i){ current.de_stat = de_stat[[i]] current.meta = unique(current.de_stat[, c("Nhood" , "Nhood_center" , "test_performed" )]) return(current.meta) }) meta_nhoods = do.call(rbind , meta_nhoods) colData(de_stat_sce) = DataFrame(meta_nhoods) colnames(de_stat_sce) = meta_nhoods$Nhood

In the original algorithm - I have this filter to discard genes that are not tested in any neighbourhood:

idx_original = which( rowMeans(is.na(assay(de_stat_sce , "pval"))) < 1) message(paste0("Total number of genes = ", nrow(de_stat_sce)))

Total number of genes = 29452

message(paste0("After original filtering, remaining number of genes = ", length(idx_original)))

After original filtering, remaining number of genes = 12565

You can push this further and discard genes that are either not tested or not significant in all tested neighbourhood

pval.thresh = 0.01 idx_updated = which( rowSums( assay(de_stat_sce , "pval") < pval.thresh , na.rm = T) > 0) message(paste0("After updated filtering, remaining number of genes = ", length(idx_updated)))

After updated filtering, remaining number of genes = 5413

So we got rid of more of the genes - instead 12.5k we have 5.5k - that's half the time now

Lets perform p-val correction across neighbourhoods

de_stat_sce = de_stat_sce[idx_updated , ] pval_corrected = bplapply(rownames(de_stat_sce) , function(gene){ pvals = assay(de_stat_sce , "pval")[gene , ] out = spatial_pval_adjustment(nhoods_x = as.matrix(nhoods_x[,subset_nhoods]) , pvalues = pvals) return(out) }, BPPARAM = mcparam)

add the assay into final de-stat

assay_pval_corrected_across_nhoods = do.call(rbind , pval_corrected) rownames(assay_pval_corrected_across_nhoods) = rownames(de_stat_sce) colnames(assay_pval_corrected_across_nhoods) = colnames(de_stat_sce) assay(de_stat_sce , "pval_corrected_across_nhoods") = assay_pval_corrected_across_nhoods



Please let me know if this response up to your satisfaction and I apologise about long running time. Let me know if you have more questions or smth is unclear.
Based on this discussion, I'm considering now to introduce this filtering based on p-value threshold inside the `de_test_neighbourhoods` - I might do it for the next release. Thanks for bringing this up
amissarova commented 1 year ago

Hey @Biomiha, Shall I close this issue then or you have more follow up questions?

Biomiha commented 1 year ago

Hi @amissarova,

sure, I think we can close the issue. I'll hopefully find some time to test the things you suggest and maybe add a comment if needed.

Thanks