Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

Speed difference in way you add mcols to GRanges #11

Closed Roleren closed 5 years ago

Roleren commented 5 years ago

grl is some large GRangesList gr <- unlist(gr, use.names = F)

system.time(gr$names <- "1") user system elapsed 19.139 0.014 19.152

temp <- gr gr <- unlist(gr, use.names = F) system.time(mcols(x = gr) <- DataFrame(row.names = names(gr), names = rep("1", length(gr)))) user system elapsed 0.024 0.000 0.025

identical(temp, gr) [1] TRUE

This is a 800x speed increase in case 2, I think this slows down a lot of packages here on Bioc, since they often use the first syntax.

Or is there a specific reason for this ?

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicFeatures_1.33.2 GenomicRanges_1.33.13 IRanges_2.15.17

...

hpages commented 5 years ago

Note that they are not identical for me:

library(GenomicRanges)
example(GRangesList)
biggrl <- rep.int(grl, 150000)
gr1 <- gr2 <- unlist(biggrl, use.names=FALSE)
gr1$foo <- 1
mcols(gr2) <- DataFrame(foo=rep(1, length(gr2)))
identical(gr1, gr2)
# [1] FALSE

This is because gr1$foo <- 1 adds a new metadata column (or replace an existing one) whereas mcols(gr2) <- DataFrame(...) sets new metadata columns "from scratch" i.e. drops all the existing ones. So the former has to pay the extra cost of binding the new metadata column to the existing ones.

Anyway the price of this binding seems pretty high. Down deep in the code this binding is done with:

DF1 <- mcols(gr1)
DF2 <- DataFrame(foo=rep(1, length(gr1)))
system.time(c(DF1, DF2))
#    user  system elapsed 
#   2.045   0.000   2.045 

The c method for DataFrame objects uses cbind() to combine its arguments. More precisely, it uses the do.call(cbind, ...) form which seems to be much slower than calling cbind() directly:

system.time(do.call(cbind, list(DF1, DF2)))
#    user  system elapsed 
#   2.035   0.000   2.036 
system.time(cbind(DF1, DF2))
#    user  system elapsed 
#   0.002   0.000   0.001 

Same problem if I replace cbind with the DataFrame constructor:

system.time(do.call(DataFrame, list(DF1, DF2)))
#    user  system elapsed 
#   2.835   0.000   2.835 
system.time(DataFrame(DF1, DF2))
#    user  system elapsed 
#   0.002   0.000   0.001 

@lawremi Any idea why would that be?

Thanks, H.

Roleren commented 5 years ago

Yes, forgot that, here is time if you want to append and delete if equal:

Given last output, lets append 1 mcol more:

Slow version: system.time(temp$names2 <- "2") user system elapsed 20.231 0.016 20.247

Faster version: system.time({gr$names2 <- NULL; mcols(x = gr) <- DataFrame(row.names = names(gr), mcols(gr), names2 = "2")}) user system elapsed 0.052 0.000 0.052

identical(temp, gr) [1] TRUE

From what you show, it looks like they call different cbind functions ? I can try to profile code and see what do.call(cbind) does.

Roleren commented 5 years ago

Okey, from what I see:

Rprof() temp$names2 <- "2" Output is: Slow version, repeats this line over and over: "deparse" "FUN" "lapply" "sapply" "sapply" "DataFrame" "cbind" "eval" "eval" "eval" "standardGeneric" "" "do.call" "do.call" ".local" "c" "c" "c" "coerce2" "coerce2" ".append_list_element" "setListElement" "setListElement" ".nextMethod" "eval" "eval" "callNextMethod" "[[<-" "[[<-" "$<-"

Fast version gives this line: "DataFrame"

Another strange thing is that if you do this again: temp$names2 <- "2" it ends in 0.2 seconds, so it reuses it's old DataFrame column.

But doing temp$SomethingNew <- "2" Takes forever.

Would it be safe to overwrite the $<- operator to do DataFrame directly ?

hpages commented 5 years ago

Hi @Roleren ,

Looks like @lawremi took care of this in S4Vectors. See https://github.com/Bioconductor/S4Vectors/issues/24

H.