Bioconductor / GenomicFeatures

Conveniently import and query gene models
https://bioconductor.org/packages/GenomicFeatures
25 stars 12 forks source link

pmapFromTranscripts strange behaviour #10

Open Roleren opened 6 years ago

Roleren commented 6 years ago

There is some strange behavour on how it handles names in the transcripts: Sorry for bad test data, made this quickly:

tx is a GRangesList of 100.000 transcripts: ranges is 600.000 ORFs on the transcripts as IRanges orfs$index is the index for each orf which transcript it came from

See how the time is different:

  1. Without names:
grl <- tx 
names(grl) <- NULL
system.time(pmapFromTranscripts(x = ranges, transcripts = grl[orfs$index]))
   user  system elapsed 
 19.661   1.701  21.355 
  1. With names:
grl <- tx
system.time(pmapFromTranscripts(x = ranges, transcripts = grl[orfs$index]))
   user  system elapsed 
 74.474   3.616  78.071 
  1. Without names, and set them afterwards, so result is same as 2.
    names(grl) <- NULL
    system.time({genomic <- pmapFromTranscripts(x = ranges, transcripts = grl[orfs$index]);
                     names(genomic) <- names(tx)[orfs$index] })
    user  system elapsed 
    19.963   1.634  21.591 

So this means that 2. is almost 4 times slower, while we could have done 3 , which is as fast a 1.

Is this intentional ?

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicFeatures_1.33.2 GenomicRanges_1.33.13 IRanges_2.15.17

...

Roleren commented 5 years ago

I have included a c++ made version of pmapFromTranscripts in my bioC package ORFik, it fixes a lot of the slowdowns and strange behaviours in the GenomicFeatures version. I called it pmapFromTranscriptF. https://github.com/JokingHero/ORFik/blob/master/src/pmapFromTranscripts.cpp

vobencha commented 5 years ago

Thanks @Roleren . I'm traveling for a few days and will take a look when I'm back. Valerie

Roleren commented 5 years ago

I have been thinking a lot on genomic features lately, we have a few master students from algorithms that could do c/c++ implementations as projects, if that is interesting, but of course, it needs to be professional enough, let me know what you think.

vobencha commented 5 years ago

@Roleren I am transferring this issue to Martin @mtmorgan . Johannes @jotsetung has also worked on mapping functions in ensembldb and it would be great to coordinate efforts so we end up with a minimal number of mappers.

I think the offer of C/C++ implementations for known bottlenecks would be welcome especially if presented as pull requests with unit tests etc. Martin may expand / have more guidance on that.

Roleren commented 5 years ago

Comparison of the three:

map 1.9 Million IRanges ORFs (result) , to ensembl human transcriptopme (grl).

ORFIK (1.3.7):

system.time(mapped <- ORFik:::pmapFromTranscriptF(result, grl, FALSE)) user system elapsed 31.041 12.012 43.252 22 GB memory usage peak

GenomicFeatures (1.35.4):

system.time(mapped2 <- pmapFromTranscripts(result, grl[result$index])) user system elapsed 352.025 40.455 395.471 61 GB memory peak. # My computer is 64GB, so was very close to crash here

identical(ranges(mapped), ranges(mapped2)) [1] TRUE

Ensembldb (2.7.5):

ensemblMap <- function() ensembldb:::transcriptToGenome ... NOTE: Had to make my own version to accept grl as GRangesList and not db object.

system.time(mapped3 <- ensemblMap(result, grl)) Timing stopped at: 781.4 0.156 781.5 # I just stopped it here 2 GB memory peak. # <- this version does not vectorize (not a pmap)

Did some timings on smaller objects here:

system.time(ensemblMap(re[1], grl)) user system elapsed 0.101 0.004 0.105 system.time(ensemblMap(re[1:10], grl)) user system elapsed 1.001 0.000 1.002 system.time(ensemblMap(re[1:100], grl)) user system elapsed 9.633 0.000 9.633 system.time(mapped3 <- ensemblMap(re[1:1000], grl)) user system elapsed 99.072 0.008 99.069

identical(ranges(mapped[1:1000]), ranges(mapped3)) [1] TRUE

So 1.9 million ORFs should use around 190.000 seconds == 2 days

The problem with ORFik and GenomicFeatures is the big memory usage, but I say people who do 1.9 million uORFs do have at least 32 GB memory. And waiting 31 seconds is way better than 2 days.

The main problem now I think, is that people like me are starting to use large data-sets. And many Bioconductor packages does not support this well enough yet. Creating terrible bottlenecks, I managed to crash a 2 TB memory server running 1 thread of countOverlaps on 2 Million ORFs vs 200 Million ribo-seq reads, so then I start to think. How can 200 million reads (2.5 GB as R object) explode into 2 terabytes ?

jorainer commented 5 years ago

Thanks for the nice benchmark @Roleren . Regarding the ensembldb version: yes, it is not parallelized - and on top seems to perform also pretty poorly. In fact, I did implement it with much smaller use-cases in mind (that require however also mapping to-and-from the proteome). I think that's a nice wake-up call to work a little on performance 😉 - suggestions and pull requests highly welcome.

Just out of personal interest, could you provide the code of your ensemblMap function?

Roleren commented 5 years ago

Use ORFik, or make your own ranges in transcript coordinates IF you want to use ORFik, do this on biocDevel, 3.9 is much faster:

BiocManager::install("ORFik")
library(ORFik)
library(GenomicFeatures)
txdb <- loadDb(gtfdb) # a gtf file as db
tx <- exonsBy(txdb, by = "tx", use.names = TRUE) # the transcripts, human transcriptome
seqs <- extractTranscriptSeqs(FaFile(faiName), transcripts = tx ) #fasta genome

minimumLength = 0  # orfik parameters, we use internals from ORFik
longestORF = FALSE
startCodon =  "ATG" # This is around 5 million orfs for me, for more do startCodon =  "ATG|CTG"
stopCodon = ORFik::stopDefinition(1)

 orfs <- ORFik:::orfs_as_List(fastaSeqs = as.character(seqs, use.names = FALSE),
             startCodon = startCodon, stopCodon = stopCodon, minimumLength = 0)

 x <- IRanges(orfs$orf[[1]], orfs$orf[[2]])
howManyORFsToTest <- 1e2 # a hundred orfs, around 10 sec run time in ensembldb
x <- x[1:howManyORFsToTest]
fromTranscript <- orfs$index[1:howManyORFsToTest]
names(x) <- names(tx)[fromTranscript] # vector here, which transcript did orfs[i] come from ?
ensemblMap <- function(x, exns){  # this is from function: ensembldb:::.tx_to_genome
  res <- lapply(split(x, f = 1:length(x)), function(z) {
    if (any(names(exns) == names(z))) {
      gen_map <- ensembldb:::.to_genome(exns[[names(z)]], z)
      if (length(gen_map)) {
        mcols(gen_map) <- DataFrame(mcols(gen_map), tx_start = start(z),
                                    tx_end = end(z))
        gen_map[order(gen_map$exon_rank)]
      } else GRanges()
    } else GRanges()
  })
  names(res) <- names(x)
  as(res, "GRangesList")
}

system.time(ensemblMap(x, tx))

jorainer commented 5 years ago

Side note: one of the performance issues in ensembldb is that an SQL query is executed for each input position - this leads to performance gains for few, but losses for many input positions (see also https://github.com/jotsetung/ensembldb/issues/92).

Roleren commented 5 years ago

Nice, yeah, thought so, it is better in the small cases, I sometimes in ORFik do stuff like:

if(length(grl) > 5e3))  fastForManyFunction
else fastForFewVersion
mtmorgan commented 5 years ago

I appreciate the contribution. GenomicFeatures has RUnit-based unit tests at inst/unitTests/test_coordinate-mapping-methods.R; any indication about how comprehensive the other implementations are? Something like BiocGenerics:::testPackage("GenomicFeatures", pattern = "test_coordinate") runs the tests on the installed package.

In general I think it would be helpful to formalize the expectations for these functions, including backward compatibility except in case of error. Ideally as extensions of the unit tests in GenomicFeatures.

Roleren commented 5 years ago

Hm, just one thing, I make most my C stuff with rcpp, but most standard packages use pure C, will this continue or will it be a change to rcpp ? I just think the coding time and error checks are much easier with rcpp.

mtmorgan commented 5 years ago

There's value in consistency within a package, but actually if C++ is a better choice then for a particular task then that's fine.