Closed romanzenka closed 3 years ago
@romanzenka we know about that. the current version tries to fetch everything. we are going to fix it.
A possible workaround is applying some filtering using rawrr::readIndex('someRawFileName')
and fetching only the scans of interest. Why do you want to read all spectra at once?
C
We are essentially making a specialized "search engine" that processes all spectra.
We understand that going to C / .NET would be best for such job, but R is very convenient otherwise and has a lot of functionality we like. Being able to do these odd jobs in R would be great.
I'd be willing to try to provide a pull request, but I am afraid I'd collide with your design plans as you are already aware of this issue.
Hi @romanzenka,
some comments: it is true that fetching a small number of spectra is relatively slow. This is due to a big processing overhead when calling our managed code (the rawrr.exe) using a system call, plus writing tmp files to disc and needing to read and parse tmp data. I recommend looking at this presentation, especially slide 5. @cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls). But the first results look very promising and would boost reading speeds especially for very small and selective data requests on many files!
Hope this helps, Tobi
Regarding your plans of implementing a search engine directly in R: I have big doubts that this makes sense! R is an interpreted language and not suited for heavy data lifting. This is why most R functions that crucially depend on performance are implemented in C.
see http://adv-r.had.co.nz/Performance.html
If you still think you are missing a crucial functionality that could be provided by rarwrr please feel free to suggest something and we can think about making it happen, BUT it should make sense from a code design perspective.
...and because you phrased this statement is a very actual way:
"It would take 3 hours just to read a single file."
No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the rawrr::readSpectrum()
function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.
No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the
rawrr::readSpectrum()
function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.
I understand that very well, which is why I only call the function once. The speed is still so slow that it is not useable. I suspect that is because that the function gathers metadata one spectrum at a time, which likely involves many seeks within the .raw file to gather all that info + complex parsing and similar.
@cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls).
I agree that having the engine in memory, "heated up and rearing to go" would be of great benefit if you can pull it off.
The low speed I am experiencing is most likely not a result of writing/parsing text files - that operation takes a tiny fraction of the time considering a size of one spectrum. A second is basically an eon in computer time... my hard drive can pump ~100MB in a single second into memory. The inefficiency is likely elsewhere, but I shall not speculate before I have numbers.
A developer from the ProteoWizard/MSconvert project once told me: "When using vendor libraries you need to know how to pet the cat!" So, if you think you know better than @cpanse, please go ahead and suggest changes to our managed code. The C# source is available here. We are always open for pull requests as long as they comply with the Bioc guidelines and fit into the package scope. An example can be found here
I think what I have to do is to create a version of the scan reading function that reads only what I need and nothing more. That should cut down on the time spent gathering the additional metadata that my code downstream simply ignores. If that is not going to be good enough, it might be necessary for the vendor to provide some "accelerator" functions, using their deep knowledge of the file format.
Also, I realized that the way data is passed into R at the moment is by generation and subsequent parsing of R source code. So the second trick would be to pass the data maybe as raw bytes, and then disentangle them on the R end using a simpler method than full-blown "eval" which has to be ready for anything an R programmer can throw at it - thus more complex - thus slower.
@romanzenka Can you provide more details of your request?
What data do you want? E.g., centroided peaks or segments (profile)?
How do you want the data to be read by R? e.g., base64 encoded one peak list a line using the scan
method.
Can you provide me access to a raw file you are going to use? (you can also send me an email cp@fgcz.ethz.ch with the download link)
I think https://github.com/fgcz/rawrr/issues/44 is the ultimate way to go. Meanwhile, I can try to provide a code snippet to solve your issue.
@cpanse
at the moment it is incredibly bare-bones. I basically need the precursor m/z and charge, then two arrays (or one interleaved, or whatever) of m/z + intensity pairs, centroided.
Since I spoke to you, did some minor benchmarking.
a <- 1:10000 / 7 # Some numbers
v <- paste0("list(a=c(", paste(a, collapse=", "), ")")
microbenchmark::microbenchmark(eval(v))
... and I am getting about 1.5 microseconds for this. That could mean that maybe the R parse is fast enough and this is not the culprit, so we could spare ourselves the pain of doing a binary transfer or base64.
@romanzenka I hope that helps.
commit 1637d6f0 on git@git.bioconductor.org:packages/rawrr (check out and R CMD build or wait for two days)
# fetch via ExperimentHub
library(ExperimentHub)
eh <- ExperimentHub::ExperimentHub()
EH4547 <- normalizePath(eh[["EH4547"]])
(rawfile <- paste0(EH4547, ".raw"))
if (!file.exists(rawfile)){
file.copy(EH4547, rawfile)
}
R> bm <- lapply(2^(0:14), function(n, ...){
+ m0 <- microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='default')}, ...)
+ m1 <- microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='barebone')}, ...)
+
+ data.frame(time = c(m0$time, m1$time), mode=c('default', 'barebone'), n=n)
+ }, times=1, unit="nanosecond") |> Reduce(f='rbind')
R> bm
time mode n
1 983118992 default 1
2 906433494 barebone 1
3 902113611 default 2
4 871311213 barebone 2
5 890822867 default 4
6 879356766 barebone 4
7 895267636 default 8
8 909109441 barebone 8
9 930387498 default 16
10 881011362 barebone 16
11 929100467 default 32
12 857490072 barebone 32
13 914358999 default 64
14 872367250 barebone 64
15 962366760 default 128
16 876129902 barebone 128
17 996060642 default 256
18 908822154 barebone 256
19 1170730769 default 512
20 925475452 barebone 512
21 1963340186 default 1024
22 1120511427 barebone 1024
23 3557690212 default 2048
24 1409178241 barebone 2048
25 6165030108 default 4096
26 1976297334 barebone 4096
27 10846751392 default 8192
28 3010938648 barebone 8192
29 29449842481 default 16384
30 6763253400 barebone 16384
R> lattice::xyplot(time ~ n, groups=bm$mode, data=bm, type='b', scale=list(log=TRUE), ylab='time [in nanosecond]', xlab='number of spectra')
R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.0.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] C/UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] tartare_1.7.2 ExperimentHub_2.1.4 AnnotationHub_3.1.5
[4] BiocFileCache_2.0.0 dbplyr_2.1.1 BiocGenerics_0.39.2
loaded via a namespace (and not attached):
[1] KEGGREST_1.33.0 tidyselect_1.1.1
[3] BiocVersion_3.14.0 purrr_0.3.4
[5] lattice_0.20-44 vctrs_0.3.8
[7] generics_0.1.0 htmltools_0.5.2
[9] stats4_4.1.1 yaml_2.2.1
[11] utf8_1.2.2 interactiveDisplayBase_1.31.2
[13] blob_1.2.2 rlang_0.4.11
[15] pillar_1.6.3 later_1.3.0
[17] withr_2.4.2 glue_1.4.2
[19] DBI_1.1.1 rappdirs_0.3.3
[21] bit64_4.0.5 GenomeInfoDbData_1.2.7
[23] lifecycle_1.0.1 zlibbioc_1.39.0
[25] Biostrings_2.61.2 memoise_2.0.0
[27] Biobase_2.53.0 IRanges_2.27.2
[29] fastmap_1.1.0 httpuv_1.6.3
[31] GenomeInfoDb_1.29.8 curl_4.3.2
[33] fansi_0.5.0 AnnotationDbi_1.55.1
[35] Rcpp_1.0.7 xtable_1.8-4
[37] promises_1.2.0.1 filelock_1.0.2
[39] BiocManager_1.30.16 cachem_1.0.6
[41] S4Vectors_0.31.4 XVector_0.33.0
[43] mime_0.11 bit_4.0.4
[45] microbenchmark_1.4.9 png_0.1-7
[47] digest_0.6.27 dplyr_1.0.7
[49] shiny_1.7.0 grid_4.1.1
[51] tools_4.1.1 bitops_1.0-7
[53] magrittr_2.0.1 RCurl_1.98-1.4
[55] tibble_3.1.4 RSQLite_2.2.8
[57] rawrr_1.3.2 crayon_1.4.1
[59] pkgconfig_2.0.3 ellipsis_0.3.2
[61] rstudioapi_0.13 assertthat_0.2.1
[63] httr_1.4.2 R6_2.5.1
[65] compiler_4.1.1
Cheers
Thank you! I have achieved very comparable results (modulo the start, some caches were not warm enough):
Testing on our files now.
I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.
I'm updating the test to a) read only MS2 spectra b) cycle through different files so we do not get overly optimistic results thanks to caching of previously loaded data.
Hopefully I will have plots shortly - what I am curious about seeing is "spectra per second", so I'll modify the plot a bit.
Below is a chart (it tops at 8192 spectra because the code crashed, investigating now) showing the times.
The difference is that each microbenchmark is ran on a completely different .raw file to reduce the effect of caching. I used a 24 fraction set of .raw files to make sure I have a fresh one for each query.
Here is the same thing with spectra per second plotted on Y axis. The update you provided did have a dramatic effect on read times. Thank you!
Well, I tracked down the bug. If I load 16,384 spectra from a particular file, my R crashes when it tries to source
the resulting 1.1GB of R source code. The extraction itself takes about 1 minute, at some impressive 270 spectra per second... but then R cannot handle the parse on my 32GB RAM laptop. I get:
negative length vectors are not allowed
I think we ran over max vector lengths in R. That might be a future improvement, for now I will simply run the input in chunks big enough to get me speed, but small enough not to kill R.
I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.
thanks; I fixed that. commit 36f43e15 C
rawrr::readSpectrum is very slow, making it unuseable to read files with 10,000s of spectra
By slow I mean it takes ~1 second on my 1 year old Macbook Pro to read a spectrum. (I do call the function once, with list of spectrum ids.)
It would take 3 hours just to read a single file. That renders the package unuseable by some two orders of magnitude.
I will be investigating to figure out what is the culprit. It might be necessary to add switches that remove some "advanced" functionality from spectrum reads to get the performance back (?).