drisso / SingleCellExperiment

Clone of the Bioconductor repository for the SingleCellExperiment package, see https://bioconductor.org/packages/devel/bioc/html/SingleCellExperiment.html for the official development version.
65 stars 18 forks source link

Printing is suddenly slow? #31

Closed LTLA closed 5 years ago

LTLA commented 5 years ago

Do other people get this behaviour, or is it just my computer?

library(scRNAseq)
sce <- ZeiselBrainData()

# Inflating the data to make it a bit more obvious.
Y <- cbind(sce, sce, sce, sce)
Y <- cbind(Y, Y, Y, Y)

# Comparing timings
system.time(show(Y))
##   user  system elapsed 
##  0.012   0.000   0.014 
system.time(print(Y))
##    user  system elapsed 
##  1.857   1.513   3.369 

This weird print slow-down is a problem is because, somehow, typing Y takes 3 seconds to show the contents of the SCE - so it's calling print() (or whatever lies underneath it) rather than show. Why would this be? AFAICT pure SummarizedExperiment instances do not exhibit this:

Z <- as(Y, "SummarizedExperiment")
system.time(show(Z))
##    user  system elapsed 
##   0.007   0.000   0.007 
system.time(print(Z))
##    user  system elapsed 
##   0.007   0.000   0.007 

... yet there's nothing special in the SCE that could cause it. Deleting the specialized SCE show method does not help (not surprising, as show wasn't the problem anyway).

Incidentally, this also somehow triggers a copy of the contents of the SCE, manifesting as a spike in memory usage beyond the build system limits for Bioconductor/Contributions#1150.

I got as far as the .Internal call inside print.default before I gave up trying to figure this out. Need some R wizards, tagging @mtmorgan or @hpages.

Session info ``` R version 3.6.0 Patched (2019-05-02 r76458) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS Matrix products: default BLAS: /home/luna/Software/R/R-3-6-branch-dev/lib/libRblas.so LAPACK: /home/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scRNAseq_1.11.8 SingleCellExperiment_1.7.0 [3] SummarizedExperiment_1.15.5 DelayedArray_0.11.4 [5] BiocParallel_1.19.0 matrixStats_0.54.0 [7] Biobase_2.45.0 GenomicRanges_1.37.14 [9] GenomeInfoDb_1.21.1 IRanges_2.19.10 [11] S4Vectors_0.23.17 BiocGenerics_0.31.5 loaded via a namespace (and not attached): [1] tidyselect_0.2.5 purrr_0.3.2 [3] lattice_0.20-38 vctrs_0.2.0 [5] htmltools_0.3.6 BiocFileCache_1.9.1 [7] yaml_2.2.0 interactiveDisplayBase_1.23.0 [9] blob_1.2.0 rlang_0.4.0 [11] pillar_1.4.2 later_0.8.0 [13] glue_1.3.1 DBI_1.0.0 [15] rappdirs_0.3.1 bit64_0.9-7 [17] dbplyr_1.4.2 GenomeInfoDbData_1.2.1 [19] zlibbioc_1.31.0 ExperimentHub_1.11.1 [21] memoise_1.1.0 httpuv_1.5.1 [23] curl_3.3 AnnotationDbi_1.47.0 [25] Rcpp_1.0.1 xtable_1.8-4 [27] backports_1.1.4 promises_1.0.1 [29] BiocManager_1.30.4 XVector_0.25.0 [31] mime_0.7 bit_1.1-14 [33] AnnotationHub_2.17.3 digest_0.6.20 [35] dplyr_0.8.3 shiny_1.3.2 [37] grid_3.6.0 tools_3.6.0 [39] bitops_1.0-6 magrittr_1.5 [41] RCurl_1.95-4.12 tibble_2.1.3 [43] RSQLite_2.1.1 crayon_1.3.4 [45] pkgconfig_2.0.2 zeallot_0.1.0 [47] Matrix_1.2-17 assertthat_0.2.1 [49] httr_1.4.0 R6_2.4.0 [51] compiler_3.6.0 ```
mtmorgan commented 5 years ago

I think the C code at https://github.com/wch/r-source/blob/e1f8bd60dac460211bfdb522137cbea13d2ac0fb/src/main/print.c#L337 does

> do.call(getNamespace("methods")[["show"]], list(Y))

which is correspondingly slow; I don't know why... @lawremi

mtmorgan commented 5 years ago

Well, a little more. using print(Y) passes through unique.c:1410, which triggers the object to be duplicated, with corresponding allocation and garbage collection

 -> 1410        PROTECT(actuals = duplicate(CDR(funcall)));

this comes from match.call() at https://github.com/wch/r-source/blob/e1f8bd60dac460211bfdb522137cbea13d2ac0fb/src/library/methods/R/NextMethod.R#L138, which is in the branch of callNextMethod() that is used when there are no arguments.

So it turns out that a workaround is to use callNextMethod() with the appropriate arguments

SingleCellExperiment master$ git diff
diff --git a/R/SCE-misc.R b/R/SCE-misc.R
index 7f44d02..32b02fa 100644
--- a/R/SCE-misc.R
+++ b/R/SCE-misc.R
@@ -13,7 +13,7 @@ scat <- function(fmt, vals=character(), exdent=2, ...) {
 }

 .sce_show <- function(object) {
-    callNextMethod()
+    callNextMethod(object)
     scat("reducedDimNames(%d): %s\n", reducedDimNames(object))
     scat("spikeNames(%d): %s\n", spikeNames(object))
 }

I think this would also apply to other uses of callNextMethod(). @lawremi probably has more profound input...

LTLA commented 5 years ago

Hm. I was of the understanding that callNextMethod() (w.o. arguments) was the way to do S4 dispatch. If we were meant to do callNextMethod(object), there's a lot of code that would need to change.

mtmorgan commented 5 years ago

Suggested as a workaround. As a reproducible example there is

setClass("A", slots = c(x = "numeric"))
.Aa = setClass("Aa", contains = "A")
setMethod("show", "A", function(object) cat("len", length(object@x), "\n"))
setMethod("show", "Aa", function(object) callNextMethod())

exhibits the issue

a = .Aa(x = rnorm(1e8))
a
system.time(show(a))
system.time(print(a))

with the last two lines

> system.time(show(a))
len 100000000
   user  system elapsed
  0.031   0.001   0.031
> system.time(print(a))
len 100000000
   user  system elapsed
  0.497   0.402   0.899
lawremi commented 5 years ago

Probably just need to shallow_duplicate().

mtmorgan commented 5 years ago

I found that these two changes restore speed

~/src/R-devel$ svn diff
Index: src/library/methods/src/methods_list_dispatch.c
===================================================================
--- src/library/methods/src/methods_list_dispatch.c (revision 76848)
+++ src/library/methods/src/methods_list_dispatch.c (working copy)
@@ -676,7 +676,7 @@
     PROTECT(op = findVarInFrame3(ev, R_dot_nextMethod, TRUE));
     if(op == R_UnboundValue)
    error("internal error in 'callNextMethod': '.nextMethod' was not assigned in the frame of the method call");
-    PROTECT(e = duplicate(matched_call));
+    PROTECT(e = shallow_duplicate(matched_call));
     prim_case = isPrimitive(op);
     if (!prim_case) {
         if (inherits(op, "internalDispatchMethod")) {
Index: src/main/unique.c
===================================================================
--- src/main/unique.c   (revision 76848)
+++ src/main/unique.c   (working copy)
@@ -1407,7 +1407,7 @@
     /* Get the formals and match the actual args */

     formals = FORMALS(b);
-    PROTECT(actuals = duplicate(CDR(funcall)));
+    PROTECT(actuals = shallow_duplicate(CDR(funcall)));

     /* If there is a ... symbol then expand it out in the sysp env
        We need to take some care since the ... might be in the middle

but have no idea what the consequences of using shallow_duplicate() are...

lawremi commented 5 years ago

Should be safe. They're just manipulating the argument lists, not the arguments themselves. When I originally converted things to shallow duplication, I avoided the argument list duplications, because those rarely contain data. The plan was to come back and convert the rest later if necessary.

LTLA commented 5 years ago

Did this get fixed?

mtmorgan commented 5 years ago

pushed to R-3-6-branch and R-devel, just now. This issue can be closed.

LTLA commented 5 years ago

Thanks @mtmorgan.