MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

Peptide is not linked back to ALL its source proteins in the MZID <PeptideEvidence> field #112

Closed realizor closed 3 years ago

realizor commented 3 years ago

Hi,

I noticed that in \<PeptideEvidence> section of MSGF+ output, a peptide is not linked back to all the proteins that can generate it.

For example, in my database (human), a total of 7 proteins can produce peptide "FGGPGTASRPSSSR". But only 2 proteins showed up in the \<PeptideEvidence> section for this peptide.

    <PeptideSequence>FGGPGTASRPSSSR</PeptideSequence>
  <PeptideEvidence dBSequence_ref="DBSeq38773434" peptide_ref="Pep_FGGPGTASRPSSSR" start="2" end="15" pre="M" post="S" isDecoy="false" id="PepEv_38773434_FGGPGTASRPSSSR_2"/>
  <PeptideEvidence dBSequence_ref="DBSeq38775695" peptide_ref="Pep_FGGPGTASRPSSSR" start="2" end="15" pre="M" post="S" isDecoy="false" id="PepEv_38775695_FGGPGTASRPSSSR_2"/>

The 2 proteins reported above are TALONT000242860.p1 and TALONT000242936.p1

<DBSequence length="453" searchDatabase_ref="SearchDB_1" accession="TALONT000242860.p1" id="DBSeq38773434">
<DBSequence length="453" searchDatabase_ref="SearchDB_1" accession="TALONT000242936.p1" id="DBSeq38775695">

However, in my database fasta file, many more protein isoforms can also generate this peptide, for example:

ENSP00000435613.1 MSTRSVSSSSYRRMFGGPGTASRPSSSRSYVTTSTRTYSLGSALRPSTSRSLYASSPGGV YATRSSAVRLRSSVPGVRLLQDSVDFSLADAINTEFKNTRTNEKVELQELNDRFANYIDK

I am new to MSGF+, could you please help me with this?

Thank you so much, Pan

alchemistmatt commented 3 years ago

Thanks for reporting this. I thought that MS-GF+ would report all of the proteins that have a peptide; I don't recall seeing it skip reporting proteins in the past. Until we can look into this for MS-GF+, I can suggest a workaround: use the ProteinCoverageSummarizer to find all of the proteins that contain a peptide. Steps:

  1. Convert your .mzid file to a .tsv file using the MzidToTsvConverter
  2. Download and install the Protein Coverage Summarizer
  3. Open the .tsv file with Excel (or another spreadsheet)
  4. Copy the column with all of the peptide sequences
  5. Paste into a text file and save
  6. Start the Protein Coverage Summarizer
  7. Select your FASTA file (either your original FASTA file or the revCat.fasta file created by MS-GF+, which includes the reversed proteins)
  8. In the options, enable "Search all proteins for peptide sequence" and "Save protein to peptide mapping details". Enable other options that look useful (test with a small peptide input file to see different outputs)
  9. Click Start

The program will create a file listing the input peptides and every protein that has them.

I will work on updating the ProteinCoverageSummarizer to support reading the .tsv file from MzidToTsvConverter, which will remove the need to open the .tsv file with Excel, and will have the advantage that it can make a new .tsv with all of the columns.

realizor commented 3 years ago

Thank you so much @alchemistmatt ! Super helpful!

alchemistmatt commented 3 years ago

I have released a version of the Protein Coverage Summarizer that supports reading the .tsv file from MzidToTsvConverter and creating a new file that lists all of the proteins for each peptide. Give this a try: Release v1.3.7608

Relevant processing options to enable:

The new file name will end with _AllProteins.txt