fgcz / rawrr

Access Orbitrap data in R lang using C# mono assembly - bioconductor package
https://bioconductor.org/packages/rawrr/
54 stars 9 forks source link

Low performance when reading a lot of spectra #43

Closed romanzenka closed 2 years ago

romanzenka commented 2 years ago

rawrr::readSpectrum is very slow, making it unuseable to read files with 10,000s of spectra

By slow I mean it takes ~1 second on my 1 year old Macbook Pro to read a spectrum. (I do call the function once, with list of spectrum ids.)

It would take 3 hours just to read a single file. That renders the package unuseable by some two orders of magnitude.

I will be investigating to figure out what is the culprit. It might be necessary to add switches that remove some "advanced" functionality from spectrum reads to get the performance back (?).

cpanse commented 2 years ago

@romanzenka we know about that. the current version tries to fetch everything. we are going to fix it. A possible workaround is applying some filtering using rawrr::readIndex('someRawFileName') and fetching only the scans of interest. Why do you want to read all spectra at once? C

romanzenka commented 2 years ago

We are essentially making a specialized "search engine" that processes all spectra.

We understand that going to C / .NET would be best for such job, but R is very convenient otherwise and has a lot of functionality we like. Being able to do these odd jobs in R would be great.

I'd be willing to try to provide a pull request, but I am afraid I'd collide with your design plans as you are already aware of this issue.

tobiasko commented 2 years ago

Hi @romanzenka,

some comments: it is true that fetching a small number of spectra is relatively slow. This is due to a big processing overhead when calling our managed code (the rawrr.exe) using a system call, plus writing tmp files to disc and needing to read and parse tmp data. I recommend looking at this presentation, especially slide 5. @cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls). But the first results look very promising and would boost reading speeds especially for very small and selective data requests on many files!

Hope this helps, Tobi

tobiasko commented 2 years ago

Regarding your plans of implementing a search engine directly in R: I have big doubts that this makes sense! R is an interpreted language and not suited for heavy data lifting. This is why most R functions that crucially depend on performance are implemented in C.

see http://adv-r.had.co.nz/Performance.html

If you still think you are missing a crucial functionality that could be provided by rarwrr please feel free to suggest something and we can think about making it happen, BUT it should make sense from a code design perspective.

tobiasko commented 2 years ago

...and because you phrased this statement is a very actual way:

"It would take 3 hours just to read a single file."

No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the rawrr::readSpectrum() function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.

romanzenka commented 2 years ago

No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the rawrr::readSpectrum() function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.

I understand that very well, which is why I only call the function once. The speed is still so slow that it is not useable. I suspect that is because that the function gathers metadata one spectrum at a time, which likely involves many seeks within the .raw file to gather all that info + complex parsing and similar.

romanzenka commented 2 years ago

@cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls).

I agree that having the engine in memory, "heated up and rearing to go" would be of great benefit if you can pull it off.

The low speed I am experiencing is most likely not a result of writing/parsing text files - that operation takes a tiny fraction of the time considering a size of one spectrum. A second is basically an eon in computer time... my hard drive can pump ~100MB in a single second into memory. The inefficiency is likely elsewhere, but I shall not speculate before I have numbers.

tobiasko commented 2 years ago

A developer from the ProteoWizard/MSconvert project once told me: "When using vendor libraries you need to know how to pet the cat!" So, if you think you know better than @cpanse, please go ahead and suggest changes to our managed code. The C# source is available here. We are always open for pull requests as long as they comply with the Bioc guidelines and fit into the package scope. An example can be found here

romanzenka commented 2 years ago

I think what I have to do is to create a version of the scan reading function that reads only what I need and nothing more. That should cut down on the time spent gathering the additional metadata that my code downstream simply ignores. If that is not going to be good enough, it might be necessary for the vendor to provide some "accelerator" functions, using their deep knowledge of the file format.

Also, I realized that the way data is passed into R at the moment is by generation and subsequent parsing of R source code. So the second trick would be to pass the data maybe as raw bytes, and then disentangle them on the R end using a simpler method than full-blown "eval" which has to be ready for anything an R programmer can throw at it - thus more complex - thus slower.

cpanse commented 2 years ago

@romanzenka Can you provide more details of your request?

I think https://github.com/fgcz/rawrr/issues/44 is the ultimate way to go. Meanwhile, I can try to provide a code snippet to solve your issue.

romanzenka commented 2 years ago

@cpanse

a <- 1:10000 / 7 # Some numbers
v <- paste0("list(a=c(", paste(a, collapse=", "), ")")
microbenchmark::microbenchmark(eval(v))

... and I am getting about 1.5 microseconds for this. That could mean that maybe the R parse is fast enough and this is not the culprit, so we could spare ourselves the pain of doing a binary transfer or base64.

cpanse commented 2 years ago

@romanzenka I hope that helps.

commit 1637d6f0 on git@git.bioconductor.org:packages/rawrr (check out and R CMD build or wait for two days)

# fetch via ExperimentHub
library(ExperimentHub)
eh <- ExperimentHub::ExperimentHub()
EH4547 <- normalizePath(eh[["EH4547"]])

(rawfile <- paste0(EH4547, ".raw"))
if (!file.exists(rawfile)){
  file.copy(EH4547, rawfile)
}
R> bm <- lapply(2^(0:14), function(n, ...){
+         m0 <-  microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='default')}, ...)
+         m1 <-  microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='barebone')}, ...)
+         
+         data.frame(time = c(m0$time, m1$time), mode=c('default', 'barebone'), n=n)
+  }, times=1, unit="nanosecond") |> Reduce(f='rbind')
R> bm
          time     mode     n
1    983118992  default     1
2    906433494 barebone     1
3    902113611  default     2
4    871311213 barebone     2
5    890822867  default     4
6    879356766 barebone     4
7    895267636  default     8
8    909109441 barebone     8
9    930387498  default    16
10   881011362 barebone    16
11   929100467  default    32
12   857490072 barebone    32
13   914358999  default    64
14   872367250 barebone    64
15   962366760  default   128
16   876129902 barebone   128
17   996060642  default   256
18   908822154 barebone   256
19  1170730769  default   512
20   925475452 barebone   512
21  1963340186  default  1024
22  1120511427 barebone  1024
23  3557690212  default  2048
24  1409178241 barebone  2048
25  6165030108  default  4096
26  1976297334 barebone  4096
27 10846751392  default  8192
28  3010938648 barebone  8192
29 29449842481  default 16384
30  6763253400 barebone 16384
R> lattice::xyplot(time ~ n, groups=bm$mode, data=bm, type='b', scale=list(log=TRUE), ylab='time [in nanosecond]', xlab='number of spectra')
Screenshot 2021-12-01 at 16 50 14
R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] tartare_1.7.2       ExperimentHub_2.1.4 AnnotationHub_3.1.5
[4] BiocFileCache_2.0.0 dbplyr_2.1.1        BiocGenerics_0.39.2

loaded via a namespace (and not attached):
 [1] KEGGREST_1.33.0               tidyselect_1.1.1             
 [3] BiocVersion_3.14.0            purrr_0.3.4                  
 [5] lattice_0.20-44               vctrs_0.3.8                  
 [7] generics_0.1.0                htmltools_0.5.2              
 [9] stats4_4.1.1                  yaml_2.2.1                   
[11] utf8_1.2.2                    interactiveDisplayBase_1.31.2
[13] blob_1.2.2                    rlang_0.4.11                 
[15] pillar_1.6.3                  later_1.3.0                  
[17] withr_2.4.2                   glue_1.4.2                   
[19] DBI_1.1.1                     rappdirs_0.3.3               
[21] bit64_4.0.5                   GenomeInfoDbData_1.2.7       
[23] lifecycle_1.0.1               zlibbioc_1.39.0              
[25] Biostrings_2.61.2             memoise_2.0.0                
[27] Biobase_2.53.0                IRanges_2.27.2               
[29] fastmap_1.1.0                 httpuv_1.6.3                 
[31] GenomeInfoDb_1.29.8           curl_4.3.2                   
[33] fansi_0.5.0                   AnnotationDbi_1.55.1         
[35] Rcpp_1.0.7                    xtable_1.8-4                 
[37] promises_1.2.0.1              filelock_1.0.2               
[39] BiocManager_1.30.16           cachem_1.0.6                 
[41] S4Vectors_0.31.4              XVector_0.33.0               
[43] mime_0.11                     bit_4.0.4                    
[45] microbenchmark_1.4.9          png_0.1-7                    
[47] digest_0.6.27                 dplyr_1.0.7                  
[49] shiny_1.7.0                   grid_4.1.1                   
[51] tools_4.1.1                   bitops_1.0-7                 
[53] magrittr_2.0.1                RCurl_1.98-1.4               
[55] tibble_3.1.4                  RSQLite_2.2.8                
[57] rawrr_1.3.2                   crayon_1.4.1                 
[59] pkgconfig_2.0.3               ellipsis_0.3.2               
[61] rstudioapi_0.13               assertthat_0.2.1             
[63] httr_1.4.2                    R6_2.5.1                     
[65] compiler_4.1.1               

Cheers

romanzenka commented 2 years ago

Thank you! I have achieved very comparable results (modulo the start, some caches were not warm enough):

image

Testing on our files now.

romanzenka commented 2 years ago

I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.

I'm updating the test to a) read only MS2 spectra b) cycle through different files so we do not get overly optimistic results thanks to caching of previously loaded data.

Hopefully I will have plots shortly - what I am curious about seeing is "spectra per second", so I'll modify the plot a bit.

romanzenka commented 2 years ago

Below is a chart (it tops at 8192 spectra because the code crashed, investigating now) showing the times.

The difference is that each microbenchmark is ran on a completely different .raw file to reduce the effect of caching. I used a 24 fraction set of .raw files to make sure I have a fresh one for each query.

image

Here is the same thing with spectra per second plotted on Y axis. The update you provided did have a dramatic effect on read times. Thank you!

image

romanzenka commented 2 years ago

Well, I tracked down the bug. If I load 16,384 spectra from a particular file, my R crashes when it tries to source the resulting 1.1GB of R source code. The extraction itself takes about 1 minute, at some impressive 270 spectra per second... but then R cannot handle the parse on my 32GB RAM laptop. I get:

negative length vectors are not allowed

I think we ran over max vector lengths in R. That might be a future improvement, for now I will simply run the input in chunks big enough to get me speed, but small enough not to kill R.

cpanse commented 2 years ago

I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.

thanks; I fixed that. commit 36f43e15 C