Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

pc Performance Slow for Two GRangesList #44

Open DarioS opened 3 years ago

DarioS commented 3 years ago

A previous question on the support site has a comment on an answer from the person asking the question:

Thank you Mike and Michael for your responses. Michael's solution works in sub-second time. mapply didn't return after a few minutes on real input so I stopped it.

The solution involves the use of pc. When I do it with the current version of GenomicRanges and R 4.0.1, it takes a few minutes. Could it be a performance regression since 2017? The length of each list is also about 20000 components, like the original question.

> length(upstreamCisRegions)
[1] 23215
> length(downstreamCisRegions)
[1] 23215
system.time(allCisRegions <- pc(upstreamCisRegions, downstreamCisRegions))
   user  system elapsed 
567.914   9.354 578.292
lawremi commented 3 years ago

Thanks for discovering and reporting this. Would you please provide a reproducible example?

DarioS commented 3 years ago

I made a minimal example, and it worked fast. It must be something particular about my data.

GRL1 <- split(GRanges("chr1", IRanges(rpois(1000000, 100000), width = rpois(1000000, 1000))),
              sample(20000, 1000000, replace = TRUE))
GRL2 <- split(GRanges("chr1", IRanges(rpois(1000000, 100000), width = rpois(1000000, 1000))),
              sample(20000, 1000000, replace = TRUE))
GRL1 <- as(GRL1, "GRangesList")
GRL2 <- as(GRL2, "GRangesList")
system.time(pc(GRL1, GRL2))
user  system elapsed 
  0.314   0.035   0.363

I wanted to share my real data for you to check, but saving it failed after a few minutes of processing.

> sum(elementNROWS(upstreamCisRegions))
[1] 437067
> sum(elementNROWS(downstreamCisRegions))
[1] 596053
> class(upstreamCisRegions)
[1] "SimpleGRangesList"
attr(,"package")
[1] "GenomicRanges"
> class(downstreamCisRegions)
[1] "SimpleGRangesList"
attr(,"package")
[1] "GenomicRanges"
> save.image(upstreamCisRegions, downstreamCisRegions, file = "CisRegions.RData")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE,  : 
  subscript is out of bounds
Warning message:
In file.remove(outfile) :
  cannot remove file 'CisRegions.RDataTmp', reason 'No such file or directory'

Saving the test lists also fails.

> class(GRL1)
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"
> class(GRL2)
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"
> save.image(GRL1, GRL2, file = "test.RData")
Error in relist(flesh, PartitioningByEnd(skeleton)) : 
  shape of 'skeleton' is not compatible with 'NROW(flesh)'
Warning message:
In file.remove(outfile) :
  cannot remove file 'test.RDataTmp', reason 'No such file or directory'

I have a fairly standard R configuration.

> sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.48.0     BiocParallel_1.22.0    GenomicFeatures_1.40.0 AnnotationDbi_1.50.0  
 [5] Biobase_2.48.0         GenomicRanges_1.40.0   GenomeInfoDb_1.24.2    IRanges_2.22.2        
 [9] S4Vectors_0.26.1       BiocGenerics_0.34.0
Roleren commented 3 years ago

dont use save.image, use save save(GRL1, GRL2, file = "test.RData")

This should work

DarioS commented 3 years ago

Oh, of course. That is a silly mistake which can be made when tired. @lawremi please see the two R objects now attached. Change ZIP extension to RData after download and load into R.

test.zip