MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

mzid2tsv not working when previously was #136

Closed ryanpe13002 closed 2 years ago

ryanpe13002 commented 2 years ago

Hi all,

I have been using MS-GF+ to identify peptides from an immunopeptidomics experiment. All files are in the attached ZIP file. for_GitHub.zip

I have been using the script 1_MS-GF_whole_LUCA_proteome_NCI-H1395_IsoSeq.sh. When I do, I get the error file 1395who_14546895_4294967294.err.

I'm not sure what this error means, and have never gotten it before when working with this file (as I have done several times; the spectrum file is MSB54398A_NCI-H1395_whole_LUCA_proteome_IsoSeq.mzid). I have used the protein search database in the runscript several times with other immunopeptidomics experiments with no issues (name: LUCA_PacBio_GenCode35_collapsed_v3.IsoSeq.psdb.fa), and the same version of the config file indicated with other immunopeptidomics experiments with no issues (config_whole_LUCA_proteome_NCI-H1395.txt).

If anyone can point to me to why the mzID to TSV conversion is failing, I would most appreciate it. I am completely stumped, as these scripts have worked a half dozen times at this point and I have never seen this error before. Please let me know if there is any additional info that may be needed from me!

Kindest regards, Ryan Englander

ryanpe13002 commented 2 years ago

By way of clarification -

The .mzid file I get is the output of searching my spectrum file with the indicated search space. This .mzid file looks correct and similar to other output files I have produced using these data.

The only change here I have made from previous runs is to change carbamidomethylation from a fixed to a dynamic modification. Everything else is the same as successful runs using these data and scripts that I have employed in the past.

Running "java -version" produces the following output: openjdk version "1.8.0_222" OpenJDK Runtime Environment (build 1.8.0_222-b10) OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

alchemistmatt commented 2 years ago

The error is occurring in an external library that MS-GF+ references, specifically:

Since this is an external library, it's not something that I can debug. I suggest that you instead use the MzidToTsvConverter program, available at

On Linux (or Mac), run this program using mono

mono MzidToTsvConverter.exe MSB54398A_NCI-H1395_whole_LUCA_proteome_IsoSeq.mzid -unroll

As described in the Readme you can optionally filter by score using any combination of:

ryanpe13002 commented 2 years ago

Thank you so much for this information! It is exceptionally helpful. I have gotten the Mono-based MZID converter to work, but I have some questions about the output.

In the output .tsv file I used to get by using the converted from MS-GF+, I would get a peptide column that contained results that looked like this:

AMKPPGAQGSQSTY LTQ+0.984TWAGSHSMRY LTETWAGSHSMRY LTQ+0.984TWAGSHSMRY LTETWAGSHSMRY RQVDFDVGSASIY GSDYGNGFGGFGSY RLRSTIGVDGSVY RQFAAQTVGNTY

Using the Mono-based converter, I get output that looks like this: -.AMKPPGAQGSQSTY.- -.LTQ+0.984TWAGSHSMRY.- T.LTETWAGSHSMRY.- -.LTETWAGSHSMRY.- -.LTQ+0.984TWAGSHSMRY.- T.LTETWAGSHSMRY.- -.LTETWAGSHSMRY.- -.RQVDFDVGSASIY.- -.GSDYGNGFGGFGSY.- -.RLRSTIGVDGSVY.- -.RQFAAQTVGNTY.-

What do the symbols outside the periods mean? Are they in any way related to the sequence of the peptide? I am only interested in retaining the peptide sequence for downstream analysis, and want to ensure I am not adding/subtracting amino acid sequences inappropriately.

FarmGeek4Life commented 2 years ago

The Readme for the MzidToTsvConverter has detailed information on the columns: Peptide: The identified peptide, with prefix and suffix residues. Also includes a numeric representation of both static and dynamic post translational modifications.

If the start/end symbol/character is a dash, that means that it is the start/end of a protein entry in the .fasta file. If you want to remove them in later processing, you should be able to check that the second and second-to-last characters are both periods, and then remove the first and last two characters on those sequences.

alchemistmatt commented 2 years ago

The letter (or dash) before the first period and the letter (or dash) after the second period provide the context of the peptide in the protein. We call these "prefix" and "suffix" letters, and most PSM identification tools include them in the results for reference. By including them, you can easily determine if a peptide is fully tryptic or not (by examining the prefix residue and the residue just before the second period). As stated earlier, a dash means the peptide is at the start or end of the protein sequence. If you only care about the core peptide sequence, remove the first two and last two characters from each PSM in the TSV file.