bittremieux / spectrum_utils

Python package for efficient mass spectrometry data processing and visualization
https://spectrum-utils.readthedocs.io/
Apache License 2.0
130 stars 21 forks source link

Speedup of peptide parsing and annotation #61

Closed jspaezp closed 2 months ago

jspaezp commented 3 months ago

This PR implements 4 main things, all with the purpose of improving speed of spectrum annotation workflows.

  1. A fast-pass for unmodified peptides during the parsing.
  2. The option for a simpler parsing grammar.
  3. LRU caching of the parser (read once per session, not once per parse of a proforma sequence)
  4. The option to annotate spectra passing a list of proteoforms directly (instead of a sequence)
    • This feature is critical for me, since I have a workflow that uses both the proteoforms directly and the annotated spectra. Therefore by itself makes my workflow 2x faster.

Benchmarks

Using some dummy peptide examples the speedup i see in the parsing is:

With mods

29.51it/s -> (baseline), greedy loading, no fastpass 137.54it/s -> + unmod fastpass, cached full parser (4x improve) 168.48it/s -> + simple parser (1.22x improve,~6x from baseline)

Without mods

34.18it/s -> (baseline) greedy loading, no fastpass 995089.92it/s -> + unmod fastpass, cached full parser (~ 30000x improve) 1081006.19it/s -> + simple parser (equivalent for practical purposes)

On a heavy annotation workflow I have these changes dropped the run time from 45 mins to 2.20 :P

LMK what you think! Best

jspaezp commented 3 months ago

btw the tests that involve reading from USI are also breaking on master on my local system.

jspaezp commented 2 months ago

@bittremieux added the suggestions, LMK what you think!