MarioniLab / miloDE

Framework for sensitive DE testing (using neighbourhoods)
Other
57 stars 2 forks source link

P-value correction error: wrong args for environment subassignment with parallel job failure #30

Closed RussBainer closed 1 year ago

RussBainer commented 1 year ago

Hi, very excited about this method. Unfortunately I am hitting an error during the P-value correction step:

> de_stat <- de_test_neighbourhoods(milo ,
+                              sample_id = "ID",
+                              design = ~Risk_geno,
+                              covariates = c("Risk_geno"),
+                              subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
+                              output_type = "SCE",
+                              plot_summary_stat = TRUE,
+                              layout = "UMAP", BPPARAM = bpParam , 
+                              verbose = T)

Starting DE testing within each neighbourhood:
DE testing is completed for neighboourhood 1
DE testing is completed for neighboourhood 2
DE testing is completed for neighboourhood 3
[...]
DE testing is completed for neighboourhood 368
DE testing is completed for neighboourhood 369
DE testing is completed for neighboourhood 370

Finished DE testing within each neighbourhood.
Performing p-value correction across neighbourhoods..
Warning in parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result
Error in reducer$value.cache[[as.character(idx)]] <- values : 
  wrong args for environment subassignment

In the above code, milo is a SingleCellExperiment object with the appropriate covariates in the colData.

In the above run I specified BPPARAM as a MulticoreParam() with about half as many workers as cores available, I am currently rerunning with with a SerialParam() to see if that helps things.

amissarova commented 1 year ago

Hey, sorry to hear you run into an error.

Unfortunately, it is quite hard to parse errors when parallelization is involved (as well as it is possible, ~based on the error message, that there could be issues with the memory) - so I would suggest running the same script but without parallelization. Based on the number of neighbourhoods - it will take a while though, so maybe the best would be to start with a smaller dataset - mb you can subset your SCE object for fewer cells (e.g. for one cell type; and you can also increase k for neighbourhood assignment if you don't want to subset). For the test run, Id aim for no more than 50 neighbourhoods.

Let me know if that finishes working - if not, what is the error message you are getting? If it finishes working, try on your actual dataset (but unless you submit it on the cluster where you can request a sufficient amount of memory, be prepared that it might take a while - you can also try to parallelize it but with fewer cores).

Also, if you could share a toy dataset - I could try it myself. And if that doesn't work - can you please share your session info: sessionInfo().

Additionally: if it breaks for the toy dataset (or for the actual one), it might worth trying to debug it piece by piece. You can try something like this:

library(miloDE)
library(miloR)
library(SingleCellExperiment)
library(S4Vectors)

# define my variables
x = sce_milo
nhoods_x = nhoods(sce_milo)
subset_nhoods = NULL

# update subset_hoods based on the input
if (is.null(subset_nhoods)){
  subset_nhoods = c(1:ncol(nhoods_x))
} else {
   if (is.logical(subset_nhoods)){
    subset_nhoods = which(subset_nhoods)
  }
}
subset_nhoods = sort(subset_nhoods)

# run DE testing for each neighbourhood (please insert the correct arguments here)
de_stat = lapply(seq(length(subset_nhoods)) , function(i){
  hood_id = subset_nhoods[i]
  out = de_test_single_neighbourhood(x , nhoods_x = nhoods_x, hood_id = hood_id,
                                           sample_id = "sample", design = ~tomato, covariates = c("tomato"), contrasts = NULL,
                                           min_n_cells_per_sample = 3 ,
                                           min_count = 5 , run_separately = T)
  return(out)
})

# put it together in SCE format
de_stat_sce = SingleCellExperiment(list(logFC = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)) ,
                                            pval = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)) ,
                                            pval_corrected_across_genes = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods))))
rownames(de_stat_sce) = rownames(x)
for (i in seq(length(de_stat))){
  current.de_stat = de_stat[[i]]
  assay(de_stat_sce , "logFC")[current.de_stat$gene , i] = current.de_stat$logFC
  assay(de_stat_sce , "pval")[current.de_stat$gene , i] = current.de_stat$pval
  assay(de_stat_sce , "pval_corrected_across_genes")[current.de_stat$gene , i] = current.de_stat$pval_corrected_across_genes
}

# add coldata on nhoods
meta_nhoods = lapply(seq(length(de_stat)) , function(i){
  current.de_stat = de_stat[[i]]
  current.meta = unique(current.de_stat[, c("Nhood" , "Nhood_center" , "test_performed" )])
  return(current.meta)
})
meta_nhoods = do.call(rbind , meta_nhoods)
colData(de_stat_sce) = DataFrame(meta_nhoods)
colnames(de_stat_sce) = meta_nhoods$Nhood

# filter to discard genes that are not tested in any neighbourhood:
idx = which( rowMeans(is.na(assay(de_stat_sce , "pval"))) < 1)

# Lets perform p-val correction across neighbourhoods
de_stat_sce = de_stat_sce[idx_updated , ]
pval_corrected = lapply(rownames(de_stat_sce) , function(gene){
  res = tryCatch(
   {
    pvals = assay(de_stat_sce , "pval")[gene , ]
    out = spatial_pval_adjustment(nhoods_x = as.matrix(nhoods_x[,subset_nhoods]) , pvalues = pvals)
    out
  }, 
  error = function(dummy){
    message(paste0("code breaks for gene " , gene))
    return(NULL)
   }
 )
})

# if that does break for some gene/all genes - please share this:
nhoods(sce)
subset_nhoods
assay(de_stat_sce , "pval")[gene , ]
RussBainer commented 1 year ago

Thanks for the prompt response, and again congratulations on the exciting method.

Looking around at other resources, it seems like this is likely to be an out-of-memory error on one of the jobs, and I'll pass along any errors that I get when it finishes.

As a general rule, do you have recommendations about how to select:

I'll report back after I've tried the tests you suggested.

amissarova commented 1 year ago
  1. For optimal k - to be honest, it is a bit hard to say since changing it in either direction comes with a cost (either sensitivity in DE detection or neighbourhood homogeneity). While using order=2, I like to ensure that I have 400-500 cells on average (possibly slightly more will be better for detection, but you really running in the danger of non-homogeity within neighbourhoods). If you feel that number of the neighbourhoods is becoming a little out of hand - consider increasing k slightly. Importantly tho, I think another way would be to focus on fewer cells (e.g. same cell types). I can see few benefits of this: much quicker as an obvious one, but also you end up with more interpretable results IMO and much more straightforward downstream analysis.

  2. Im sorry but I cant really help you with memory questions:( I mostly run miloDE on cluster, and therefore I rarely run into memory issues.

Also, MB check this thread - maybe there will be some useful suggestions there: https://github.com/MarioniLab/miloDE/issues/27

RussBainer commented 1 year ago

As an update, I was able to run to completion with a smaller dataset and fewer neighborhoods regardless of my parallelization settings. I'll check out the other thread as you suggest.

For now I think we should assume that the issue was related to the number of jobs and extent of parallelization (since I did get a warning to that effect) and close the issue. Thanks again for your help!