P-value correction error: wrong args for environment subassignment with parallel job failure

RussBainer commented 1 year ago

Hi, very excited about this method. Unfortunately I am hitting an error during the P-value correction step:

> de_stat <- de_test_neighbourhoods(milo ,
+                              sample_id = "ID",
+                              design = ~Risk_geno,
+                              covariates = c("Risk_geno"),
+                              subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
+                              output_type = "SCE",
+                              plot_summary_stat = TRUE,
+                              layout = "UMAP", BPPARAM = bpParam , 
+                              verbose = T)

Starting DE testing within each neighbourhood:
DE testing is completed for neighboourhood 1
DE testing is completed for neighboourhood 2
DE testing is completed for neighboourhood 3
[...]
DE testing is completed for neighboourhood 368
DE testing is completed for neighboourhood 369
DE testing is completed for neighboourhood 370

Finished DE testing within each neighbourhood.
Performing p-value correction across neighbourhoods..
Warning in parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result
Error in reducer$value.cache[[as.character(idx)]] <- values : 
  wrong args for environment subassignment

In the above code, milo is a SingleCellExperiment object with the appropriate covariates in the colData.

In the above run I specified BPPARAM as a MulticoreParam() with about half as many workers as cores available, I am currently rerunning with with a SerialParam() to see if that helps things.

amissarova commented 1 year ago

Hey, sorry to hear you run into an error.

Unfortunately, it is quite hard to parse errors when parallelization is involved (as well as it is possible, ~based on the error message, that there could be issues with the memory) - so I would suggest running the same script but without parallelization. Based on the number of neighbourhoods - it will take a while though, so maybe the best would be to start with a smaller dataset - mb you can subset your SCE object for fewer cells (e.g. for one cell type; and you can also increase k for neighbourhood assignment if you don't want to subset). For the test run, Id aim for no more than 50 neighbourhoods.

Let me know if that finishes working - if not, what is the error message you are getting? If it finishes working, try on your actual dataset (but unless you submit it on the cluster where you can request a sufficient amount of memory, be prepared that it might take a while - you can also try to parallelize it but with fewer cores).

Also, if you could share a toy dataset - I could try it myself. And if that doesn't work - can you please share your session info: sessionInfo().

Additionally: if it breaks for the toy dataset (or for the actual one), it might worth trying to debug it piece by piece. You can try something like this:

library(miloDE)
library(miloR)
library(SingleCellExperiment)
library(S4Vectors)

# define my variables
x = sce_milo
nhoods_x = nhoods(sce_milo)
subset_nhoods = NULL

# update subset_hoods based on the input
if (is.null(subset_nhoods)){
  subset_nhoods = c(1:ncol(nhoods_x))
} else {
   if (is.logical(subset_nhoods)){
    subset_nhoods = which(subset_nhoods)
  }
}
subset_nhoods = sort(subset_nhoods)

# run DE testing for each neighbourhood (please insert the correct arguments here)
de_stat = lapply(seq(length(subset_nhoods)) , function(i){
  hood_id = subset_nhoods[i]
  out = de_test_single_neighbourhood(x , nhoods_x = nhoods_x, hood_id = hood_id,
                                           sample_id = "sample", design = ~tomato, covariates = c("tomato"), contrasts = NULL,
                                           min_n_cells_per_sample = 3 ,
                                           min_count = 5 , run_separately = T)
  return(out)
})

# put it together in SCE format
de_stat_sce = SingleCellExperiment(list(logFC = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)) ,
                                            pval = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods)) ,
                                            pval_corrected_across_genes = matrix(NA, nrow = nrow(x), ncol = length(subset_nhoods))))
rownames(de_stat_sce) = rownames(x)
for (i in seq(length(de_stat))){
  current.de_stat = de_stat[[i]]
  assay(de_stat_sce , "logFC")[current.de_stat$gene , i] = current.de_stat$logFC
  assay(de_stat_sce , "pval")[current.de_stat$gene , i] = current.de_stat$pval
  assay(de_stat_sce , "pval_corrected_across_genes")[current.de_stat$gene , i] = current.de_stat$pval_corrected_across_genes
}

# add coldata on nhoods
meta_nhoods = lapply(seq(length(de_stat)) , function(i){
  current.de_stat = de_stat[[i]]
  current.meta = unique(current.de_stat[, c("Nhood" , "Nhood_center" , "test_performed" )])
  return(current.meta)
})
meta_nhoods = do.call(rbind , meta_nhoods)
colData(de_stat_sce) = DataFrame(meta_nhoods)
colnames(de_stat_sce) = meta_nhoods$Nhood

# filter to discard genes that are not tested in any neighbourhood:
idx = which( rowMeans(is.na(assay(de_stat_sce , "pval"))) < 1)

# Lets perform p-val correction across neighbourhoods
de_stat_sce = de_stat_sce[idx_updated , ]
pval_corrected = lapply(rownames(de_stat_sce) , function(gene){
  res = tryCatch(
   {
    pvals = assay(de_stat_sce , "pval")[gene , ]
    out = spatial_pval_adjustment(nhoods_x = as.matrix(nhoods_x[,subset_nhoods]) , pvalues = pvals)
    out
  }, 
  error = function(dummy){
    message(paste0("code breaks for gene " , gene))
    return(NULL)
   }
 )
})

# if that does break for some gene/all genes - please share this:
nhoods(sce)
subset_nhoods
assay(de_stat_sce , "pval")[gene , ]

RussBainer commented 1 year ago

Thanks for the prompt response, and again congratulations on the exciting method.

Looking around at other resources, it seems like this is likely to be an out-of-memory error on one of the jobs, and I'll pass along any errors that I get when it finishes.

As a general rule, do you have recommendations about how to select:

k for neighborhood assignment (vignette seems to suggest that median 400 cells is best)
Number of workers to specify for parallelization (I notice each worker topping out at around 30GB of memory, is this standard?)

I'll report back after I've tried the tests you suggested.

amissarova commented 1 year ago

For optimal k - to be honest, it is a bit hard to say since changing it in either direction comes with a cost (either sensitivity in DE detection or neighbourhood homogeneity). While using order=2, I like to ensure that I have 400-500 cells on average (possibly slightly more will be better for detection, but you really running in the danger of non-homogeity within neighbourhoods). If you feel that number of the neighbourhoods is becoming a little out of hand - consider increasing k slightly. Importantly tho, I think another way would be to focus on fewer cells (e.g. same cell types). I can see few benefits of this: much quicker as an obvious one, but also you end up with more interpretable results IMO and much more straightforward downstream analysis.
Im sorry but I cant really help you with memory questions:( I mostly run miloDE on cluster, and therefore I rarely run into memory issues.

Also, MB check this thread - maybe there will be some useful suggestions there: https://github.com/MarioniLab/miloDE/issues/27

RussBainer commented 1 year ago

As an update, I was able to run to completion with a smaller dataset and fewer neighborhoods regardless of my parallelization settings. I'll check out the other thread as you suggest.

For now I think we should assume that the issue was related to the number of jobs and extent of parallelization (since I did get a warning to that effect) and close the issue. Thanks again for your help!

MarioniLab / miloDE

P-value correction error: wrong args for environment subassignment with parallel job failure #30