bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

implement spectrumAI in pypgatk #65

Closed ypriverol closed 1 year ago

ypriverol commented 2 years ago

spectrumAI (https://github.com/yafeng/SpectrumAI) is a tool that enables to detect the corresponding b and y ions for an specific mutation. The original algorithm was implemented in R but for better integration with the quantms pipeline and pypgatk would be great to have an implementation in python.

I suggest the following structure:

The commandline tool consume a file with the following format tsv:

canonical peptide | variant peptide | canonical aa | variant aa | position | spectra file | scan

Instead of using the code to generate the theoretical spectra, I suggest using the OpenMS function for that:

example:

from pyopenms import *

tsg = TheoreticalSpectrumGenerator()
spec1 = MSSpectrum()
peptide = AASequence.fromString("DFPIANGER")
# standard behavior is adding b- and y-ions of charge 1
p = Param()
p.setValue("add_b_ions", "false")
p.setValue("add_metainfo", "true")
tsg.setParameters(p)
tsg.getSpectrum(spec1, peptide, 1, 1) # charge range 1:1

# Iterate over annotated ions and their masses
print("Spectrum 1 of", peptide, "has", spec1.size(), "peaks.")
for ion, peak in zip(spec1.getStringDataArrays()[0], spec1):
    print(ion.decode(), "is generated at m/z", peak.getMZ())

refence: https://pyopenms.readthedocs.io/en/latest/theoreticalspectrumgenerator.html

@husensofteng can you provide an example in this format of a valid variant and a wrong variant including the mzML file.

husensofteng commented 2 years ago

that would be nice and makes it easier to run.

currently, I run it as follows:

  1. Compare the identified peptides with canonical peptides (using BLAST)
  2. Custom script to parse the BLAST output and generate a TSV containing only peptides with single amino acid changes
  3. Run spectrumAI using the generated file to check validity of the changed position.

Rscript SpectrumAI.R spemzml_dir psms_with_single_missmatch.tsv outputdir

ypriverol commented 2 years ago

@husensofteng :

husensofteng commented 2 years ago

From PXD008841 dataset, the following two peptides are identified that have one AA difference with canonical proteins, and validated by spectrumAI as PASS and FAIL, respectively.

Pass: TIAECLADELINAAK (Canonical) TIAECLAEELINAAK (Variant) Spectra file: HJOSLO2U_20140703_TMTpool1_300ugIPG3-10_7of15ul_fr10.mzML

Fail: KAAAPTPEEEMDECEQALAAEPK (Variant) KAAAPAPEEEMDECEQALAAEPK (Canonical) Spectra file: HJOSLO2U_20140703_TMTpool1_300ugIPG3-10_7of15ul_fr08.mzML