MarioniLab / miloDE

Framework for sensitive DE testing (using neighbourhoods)
Other
57 stars 2 forks source link

Slow plotting because of .check_sce_milo #37

Closed mihem closed 5 months ago

mihem commented 7 months ago

I found plot_milo_by_single_metric slow (~10min). The culprit was .check_sce_milo https://github.com/MarioniLab/miloDE/blob/172a13a7ec1a2e0a96966e00948ee2b831e296b6/R/plotting_functions.R#L70

or more specifically miloR::graph(x) https://github.com/MarioniLab/miloDE/blob/172a13a7ec1a2e0a96966e00948ee2b831e296b6/R/utils.R#L14 Commenting out this line as I did in this fork improved the performance to a few seconds https://github.com/mihem/miloDE/commit/846c1ed014b4e597a1b97e2aac57663f7345f70a#

my miloDE object is pretty large with 11.7Gb (~400k cells), but I found it pretty annoying to wait for 10min, if you want to tweak the graph a little. Removing this check helps, but maybe there's a way to get similar performance without removing this check @amissarova ?

de_test_neighbourhoods is of course much more time consuming (~72h in my case) with most of the time sepnt on p-value correciton, but at least it's usually not necessary to run this function multiple times. This has been already discussed previously https://github.com/MarioniLab/miloDE/issues/27, but maybe there are some new ideas? Using Rcpp in the slow parts would probably increase the performance significantly?

amissarova commented 5 months ago

Hi @mihem, thanks for your suggestion and I appreciate your interest in making this package better. The check here is perhaps not the most important one and I would be ok with removing it. But Im very surprised by the fact that it takes 10 min for this check.

I just created a milo-object on real data, 380k cells and then plotted a dummy nhood-stat - plotting itself took me 4 seconds, and check length(miloR::graph(x)) is instant.

Im wondering though how this happened for you - would it be possible to share details about this milo object and specifically the graph structure?

mihem commented 5 months ago

Hi @amissarova sure.

I'll send you an e-mail to alsu.missarova.32@gmail.com with a link to my dataset.

milo_DE <- qs::qread(file.path("objects", "milo_DE.qs"), nthread = 4)
de_stat <- qs::qread(file.path("objects", "milo_de_stat_pnp_ctrl.qs"), nthread = 6)
stat_de_magnitude <- rank_neighbourhoods_by_DE_magnitude(de_stat)

system.time(
p1 <- plot_milo_by_single_metric(
  milo_DE,
  stat_de_magnitude,
  colour_by = "n_DE_genes",
  layout = "UMAP.SCVI.FULL",
  size_range = c(0.2, 3),
  edge_width = c(0.1, 1.0), 
  edge_weight.thres = 10
) +
  viridis::scale_fill_viridis(name = "# DE genes", option = "inferno")
)

Or more specifically

system.time(
  length(miloR:::graph(milo_DE))
)

user system elapsed 156.065 5.677 161.747

Thanks!

amissarova commented 5 months ago

Hi @mihem, looking into this - it seems to me that the problem is not in the long time calculation of the length of the graph itself, but rather loading the graph itself. When i load your data from scracth, and run the code the first time - it takes me indeed a long time (not 10 minutes tho - 3 minutes), but when i rerun the same function - much shorter (7 seconds). See the code below:

> library(qs)
qs 0.26.1
> de_stat = qread("~/temp/milo_de_stat_pnp_ctrl.qs")
> stat_de_magnitude <- miloDE::rank_neighbourhoods_by_DE_magnitude(de_stat)
> sce = qread("~/temp/milo_DE.qs")
> t0 = proc.time()
> p1 <- miloDE::plot_milo_by_single_metric(
+     sce,
+     stat_de_magnitude,
+     colour_by = "n_DE_genes",
+     layout = "UMAP.SCVI.FULL",
+     size_range = c(0.2, 3),
+     edge_width = c(0.1, 1.0), 
+     edge_weight.thres = 10) +
+     viridis::scale_fill_viridis(name = "# DE genes", option = "inferno")
Loading required package: BPCells
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
> t1 = proc.time()
> print(t1["elapsed"] - t0["elapsed"])
elapsed 
**181.636** 
> t0 = proc.time()
> p1 <- miloDE::plot_milo_by_single_metric(
+     sce,
+     stat_de_magnitude,
+     colour_by = "n_DE_genes",
+     layout = "UMAP.SCVI.FULL",
+     size_range = c(0.2, 3),
+     edge_width = c(0.1, 1.0), 
+     edge_weight.thres = 10) +
+     viridis::scale_fill_viridis(name = "# DE genes", option = "inferno")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
> t1 = proc.time()
> print(t1["elapsed"] - t0["elapsed"])
elapsed 
  **7.329** 
> 

Therefore I assume that it takes much shorter cause of the caching usage perhaps that is happening the second time (when the graph has been loaded once before?). Im not very familiar with such artifacts unfortunately, and will have to read up about this and do a bit of research. But it seems that if the graph loading is happening at any point (and apparently it is very costly in your case), it will take a long time (the first time tho). I will think a bit and do some testing later on - perhaps I introduce a relaxed check for plotting functions or alter check all together. I will update this thread once make a decision whether I change anything and how exactly I change it.

mihem commented 5 months ago

Hi @amissarova ,

Glad you could reproduce the issue. I absolutely agree. The first time it takes really long (okay I exaggerated with 10min but 3min is still very long for a plotting function) and the second time it s fine. However in my use case it happened quite often that I wanted to make the dots larger a few days after the initial analysis and waiting for 3min is a little annoying then. Yes maybe some less costly check?

Thanks.

amissarova commented 5 months ago

@mihem , I added now a simpler check only for the plotting function, omitting loading of the graph. Thanks for your comment and im gonna close this issue now - pls reopen if you have more questions

mihem commented 5 months ago

Agree, that dramatically increases the speed in my case as well.