Searching for the exact peptides in database provided - Githubissues

MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.

Other

72 stars 36 forks source link

Searching for the exact peptides in database provided #120

Closed lydiayliu closed 3 years ago

lydiayliu commented 3 years ago

Describe the question or problem Hi there, I wish to conduct a search using MSGF+ where the algorithm only considers EXACT matches to the peptides provided in the fasta database (with a static and a dynamic modification).

For example, there are two peptide entries in the fasta file:

>peptide_1 MDFYAMIHAFWLIAVLYRR >peptide_2 MDFYAMIHAFWLIAVLYR

My samples were digested with trypsin, so in my database there are only tryptic peptides (with some miscleavages that I have already included).

I am using the following settings, these are the only ones that I can think of that is relevant:

#Enzyme ID
#  0 means No enzyme used
#  1 means Trypsin (Default); use this along with NTT=0 for a no-enzyme search of a tryptically digested sample
#  2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Enzyme (for peptidomics)
EnzymeID=9
#Number of tolerable termini
#  The number of peptide termini that must have been cleaved by the enzyme (default 1)
#  For trypsin, 2 means fully tryptic only, 1 means partially tryptic, and 0 means no-enzyme search
NTT=2

MSGF+ would return this result: sample.mzML controllerType=0 controllerNumber=1 scan=65059 65059 HCD 537.5329 2 1.9908882 4 DFYAM+15.995IHAFWLIAVLYR peptide_1(pre=M,post=R);peptide_2(pre=M,post=-) 101 42 2.5559657E-9 0.006229213

My problem with this result is 2-fold:

DFYAMIHAFWLIAVLYR is not a peptide in the database, and I do not see an option in the config to TURN OFF M-terminal M cleaveage (while I appreciate that MSGF+ probably just tried both possibilities, I still wish to turn it off to not interfere with my FDR calculations).
The "protein" column of the PSM has the name of all entries in the database that contains the peptide. In my understanding of "no enzyme" digestion, only exact matches to the peptide given in the database should be made, and even if other entries also contain the peptide (and fit the trypsin digestion pattern), they should still not be listed under "protein" because that is not the fasta entry where the match is made. This is not really a problem, but a nuisance for parsing which exact entry the peptide match came from.

Do you have any suggestions on how I could modify the params file to get cleaner results? Thanks.

alchemistmatt commented 3 years ago

Thank you for clearly describing the issue. Yes, you are using the correct settings for peptidomics (EnzymeID=9 and NTT=2) I just ran some tests, and I got the same results for both of these:

EnzymeID=0 and NTT=2
EnzymeID=9 and NTT=2

You could try EnzymeID=0 and see if it makes a difference, but I doubt it will. In my tests, all of the results for EnzymeID=9 and NTT=2 were a full "peptide" match (starting and ending with -, though, yes, some had M+15.995

MS-GF+ is in maintenance-mode status, so I cannot update it at present to exclude the M+15.995 results. Thus, you're just going to have to post-process the results to exclude the peptides you don't want. Be sure to use the Mzid to TSV Converter to convert the results, e.g. MzidToTsvConverter.exe Results.mzid -sd -unroll

As for re-computing QValue, these Excel files demonstrate how to do that manually:

alchemistmatt commented 3 years ago

Correction to my previous post; I was confusing M+15.995 and auto-removal of the N-Terminal M residue. M+15.995 is a dynamic Met Ox, and that's allowed; it's the auto-removal of M that causes concern. Given how MS-GF+ uses dynamic programming to search for matches, I'm not certain that it would be straightforward to prevent the auto M-removal. Additionally, given how FASTA files are indexed, the multi-protein reporting is probably something you'll just have to work around.

One thought would be to convert your FASTA file to tab-delimited text, then sort by peptide, then remove duplicates to only keep the first occurrence of peptide. Next, convert from tab-delimited text back to FASTA. For this, use the Protein Digestion Simulator

alchemistmatt commented 3 years ago

I did some more digging. The existing code already has the option to disable considering M cleavage at the N-terminus. Update your parameter file to have this:

# Control N-terminal methionine cleavage
#  0 means to consider protein N-term Met cleavage (Default)
#  1 means to ignore protein N-term Met cleavage
IgnoreMetCleavage=1

I will update the program to show the option's value when it displays parameters at run-time. I'll also update the documentation and the example parameter files.

alchemistmatt commented 3 years ago

Also, these are not the same:

EnzymeID=0 and NTT=2
EnzymeID=9 and NTT=2

Here's a better description of EnzymeID:

# Enzyme ID
#  0 means unspecific cleavage (cleave after any residue)
#  1 means Trypsin (Default); use this along with NTT=0 for a no-enzyme-specificity search of a tryptically digested sample
#  2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Cleavage (for peptidomics)
EnzymeID=1

alchemistmatt commented 3 years ago

Release 2021.03.22 includes an updated .jar file that shows the value of IgnoreMetCleavage at runtime. It also includes updated example parameter files.

lydiayliu commented 3 years ago

Hi Matthew,

Thank you for all the detailed investigations and comments! The IgnoreMetCleavage is certainly going to be very useful, and it's great to know that I'm using the correct setttings (EnzymeID=9 and NTT=2) for my purpose (thanks also for the clarification on what EnzymeID=0 does! it is confusing that 0 and 9 both say "no enzyme").

Thanks again!!