Closed Roleren closed 5 years ago
Note that they are not identical for me:
library(GenomicRanges)
example(GRangesList)
biggrl <- rep.int(grl, 150000)
gr1 <- gr2 <- unlist(biggrl, use.names=FALSE)
gr1$foo <- 1
mcols(gr2) <- DataFrame(foo=rep(1, length(gr2)))
identical(gr1, gr2)
# [1] FALSE
This is because gr1$foo <- 1
adds a new metadata column (or replace an existing one) whereas mcols(gr2) <- DataFrame(...)
sets new metadata columns "from scratch" i.e. drops all the existing ones. So the former has to pay the extra cost of binding the new metadata column to the existing ones.
Anyway the price of this binding seems pretty high. Down deep in the code this binding is done with:
DF1 <- mcols(gr1)
DF2 <- DataFrame(foo=rep(1, length(gr1)))
system.time(c(DF1, DF2))
# user system elapsed
# 2.045 0.000 2.045
The c
method for DataFrame objects uses cbind()
to combine its arguments. More precisely, it uses the do.call(cbind, ...)
form which seems to be much slower than calling cbind()
directly:
system.time(do.call(cbind, list(DF1, DF2)))
# user system elapsed
# 2.035 0.000 2.036
system.time(cbind(DF1, DF2))
# user system elapsed
# 0.002 0.000 0.001
Same problem if I replace cbind
with the DataFrame constructor:
system.time(do.call(DataFrame, list(DF1, DF2)))
# user system elapsed
# 2.835 0.000 2.835
system.time(DataFrame(DF1, DF2))
# user system elapsed
# 0.002 0.000 0.001
@lawremi Any idea why would that be?
Thanks, H.
Yes, forgot that, here is time if you want to append and delete if equal:
Given last output, lets append 1 mcol more:
Slow version: system.time(temp$names2 <- "2") user system elapsed 20.231 0.016 20.247
Faster version: system.time({gr$names2 <- NULL; mcols(x = gr) <- DataFrame(row.names = names(gr), mcols(gr), names2 = "2")}) user system elapsed 0.052 0.000 0.052
identical(temp, gr) [1] TRUE
From what you show, it looks like they call different cbind functions ? I can try to profile code and see what do.call(cbind) does.
Okey, from what I see:
Rprof()
temp$names2 <- "2"
Output is:
Slow version, repeats this line over and over:
"deparse" "FUN" "lapply" "sapply" "sapply" "DataFrame" "cbind" "eval" "eval" "eval" "standardGeneric" "
Fast version gives this line: "DataFrame"
Another strange thing is that if you do this again: temp$names2 <- "2" it ends in 0.2 seconds, so it reuses it's old DataFrame column.
But doing temp$SomethingNew <- "2" Takes forever.
Would it be safe to overwrite the $<- operator to do DataFrame directly ?
Hi @Roleren ,
Looks like @lawremi took care of this in S4Vectors. See https://github.com/Bioconductor/S4Vectors/issues/24
H.
grl is some large GRangesList gr <- unlist(gr, use.names = F)
system.time(gr$names <- "1") user system elapsed 19.139 0.014 19.152
temp <- gr gr <- unlist(gr, use.names = F) system.time(mcols(x = gr) <- DataFrame(row.names = names(gr), names = rep("1", length(gr)))) user system elapsed 0.024 0.000 0.025
identical(temp, gr) [1] TRUE
This is a 800x speed increase in case 2, I think this slows down a lot of packages here on Bioc, since they often use the first syntax.
Or is there a specific reason for this ?
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] GenomicFeatures_1.33.2 GenomicRanges_1.33.13 IRanges_2.15.17
...