It seems that alignedSubject and alignedPattern take an unexpectedly long time to run:
library(Biostrings)
system.time(aln <- pairwiseAlignment(subject=DNAString(c("AAACGATCAGCTACGAACACT")),
DNAStringSet(rep("AACGAGGGCCACCTAGGAAGAAT", 1000))))
## user system elapsed
## 0.208 0.008 0.219
system.time(X <- alignedPattern(aln))
## user system elapsed
## 16.622 0.008 16.783
system.time(Y <- alignedSubject(aln))
## user system elapsed
## 15.862 0.008 16.011
Almost 100 times slower than the alignment itself, which I would have expected to be the most computationally intensive part of the process! This is a shame as we've been using the full alignment strings for large-scale processing of Nanopore data. I assume that the slowness is because the addition of -s to the end of the aligned sequence is done in a lapply loop in get_aligned_pattern, rather than in C.
R version 3.5.0 Patched (2018-04-30 r74679)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] Biostrings_2.49.0 XVector_0.21.1 IRanges_2.15.13
[4] S4Vectors_0.19.11 BiocGenerics_0.27.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.27.0 compiler_3.5.0
It seems that
alignedSubject
andalignedPattern
take an unexpectedly long time to run:Almost 100 times slower than the alignment itself, which I would have expected to be the most computationally intensive part of the process! This is a shame as we've been using the full alignment strings for large-scale processing of Nanopore data. I assume that the slowness is because the addition of
-
s to the end of the aligned sequence is done in alapply
loop inget_aligned_pattern
, rather than in C.