bittremieux / ANN-SoLo

Spectral library searching using approximate nearest neighbor techniques.
Apache License 2.0
42 stars 19 forks source link

Add ELIB/DLIB spectral library support #12

Open wfondrie opened 3 years ago

wfondrie commented 3 years ago

This pull request adds a module for parsing the ELIB and DLIB spectral libraries, src/ann_solo/sqlite_parsers.py. These are SQLite3 formats from EncyclopeDIA and are defined here. The PR also changes the logging level to INFO.

This module should be easy to expand in the future to also parse BLIB libraries from Bibliospec (as requested in #2).

I'm still working on benchmarking, but it seems good so far.

wfondrie commented 3 years ago

One thing I really envisioned would be useful with this PR is the ability to use Prosit libraries with ANN-SoLo. However, there are a couple of hiccups in doing so:

  1. The web interface currently requires a CSV file specifying for which it should generate spectra.
  2. There is currently no way to annotate peptides as decoys in Prosit. Thus, the dlib file that it returns must be annotated after generation.

Would it be out-of-scope for ANN-SoLo to also contain a few utility functions to prepare a FASTA file for Prosit? For (1), I would propose adding a function to generate this CSV file from a FASTA file, similar to the functionality already provided by EncyclopeDIA. To solve (2), I think there are a couple options:

  1. Add a function that modifies the dlib file to properly indicate decoy peptide spectra.
  2. Add an optional decoy_spectral_library_filename that specifies decoy peptide spectra, implying that spectral_library_filename only defines targets.

What are your thoughts? The CSV and annotating a dlib could alternatively be provided by another package.

bittremieux commented 3 years ago

Yes, I totally agree. Prosit compatibility has been on my wish list / TODO list for quite some time.

My preference would be an end-to-end solution. Rather than having some manual steps in between getting a CSV to submit to the Prosit web interface,and then converting the output from there again, it would be nicer if ANN-SoLo has the option to generate a spectral library (and its index) from a FASTA directly using built-in Prosit.

Prosit is available as open-source, so it should be possible. Although it might complicate installation instructions more, and they're already a bit advanced.

wfondrie commented 3 years ago

That is a good goal, but yikes that does complicate installation! Do you know they have a programmatic API for their webserver? That might be an alternative way to go if they do.

Either way, I'll probably make a small separate package to handle these things for now.