Closed tivdnbos closed 6 years ago
MSGF+ treats those letters as "stop points", meaning it will never consider a peptide that contains any of those letters. Similarly, if you have asterisks (*) in your protein sequences, those too are treated as stop points. When MSGF+ indexes a FASTA file it parses the protein sequences to create candidate peptides to be compared against the spectra.
Here's an example. Given a protein that contains a J
and an *
:
FPTDDDDKIVJGGYTCAANSIPY*QVSLNSGS
If we parse this protein into candidate peptides, with a minimum length of 6 residues per peptide, we get these possible peptides (placed in columns to save space):
FPTDDD GGYTCA YTCAAN QVSLNS
FPTDDDD GGYTCAA YTCAANS QVSLNSG
FPTDDDDK GGYTCAAN YTCAANSI QVSLNSGS
FPTDDDDKI GGYTCAANS YTCAANSIP VSLNSG
FPTDDDDKIV GGYTCAANSI YTCAANSIPY VSLNSGS
PTDDDD GGYTCAANSIP TCAANS SLNSGS
PTDDDDK GGYTCAANSIPY TCAANSI
PTDDDDKI GYTCAA TCAANSIP
PTDDDDKIV GYTCAAN TCAANSIPY
TDDDDK GYTCAANS CAANSI
TDDDDKI GYTCAANSI CAANSIP
TDDDDKIV GYTCAANSIP CAANSIPY
DDDDKI GYTCAANSIPY AANSIP
DDDDKIV AANSIPY
DDDKIV ANSIPY
Oops, missed the a few peptides, like FPTDDD
and GGYTCA
but you should get the idea.
Also, to clarify, MSGF+ recognizes the 20 standard amino acids:
A C D E F G H I K L M N P Q R S T V W Y
It treats the non-standard amino acids (B J O U X Z
) plus any symbols (and any spaces) as "stop points" (term I mentioned in my first response).
Note also that you can define a custom amino acid, if you wish, by adding a line with custom
to the ModificationFileName specified with the -m
switch. Two examples:
1) Assume any J residue in the FASTA file is actually a Histidine:
C6H7N3O1,J,custom,H,Histidine # Histidine masquerading as J
2) Use U for Selenocysteine
C3H5NO, U, custom, U, Selenocysteine # Custom amino acids can only have C, H, N, O, and S
79.9166, U, fix, any, Se80 # Use a static mod to add Se
I need to update the help pages to include these examples.
Thanks for the explanation! Is there also a way to add ambiguity (i.e. the non-standard amino acids) to ModificationFileName? These are: B = D or N, J = I or L, X = unknown, Z = E or Q
I need to test this, but, in theory, placing the following in the mods.txt file should do what you want (I included oxidized methionine; you can remove that if you want).
C4H6N2O2, B, custom, D, AspOrAsn # Asparagine; see dynamic mod below for 0.984 to get Aspartic acid
C6H11NO, J, custom, IorL, IleOrLeu # Isoleucine or Leucine
C6H11NO, X, custom, Unk, UnknownAA # Unknown residue; use the empirical formula for Leucine for any X
C5H8N2O2, Z, custom, Q, Glutamine # Glutamine; see dynamic mod below for 0.984 to get Glutamic Acid
H-1N-1O, B, opt, any, Deamidated # Add 0.984 to B (aka N) to get D
H-1N-1O, Z, opt, any, Deamidated # Add 0.984 to Z (aka Q) to get E
O1, M, opt, any, Oxidation # Oxidized methionine
Explaining things: the first custom amino acid entry associates the empirical formula of Asparagine (N) with B, then the dynamic mod H-1N-1O will add 0.984 to B to get D. Similarly, there is a custom amino acid entry assigning the empirical formula of Glutamine (Q) to Z, then we optionally add 0.984 to that to get E. Since I and L have the same empirical formula, there is just one entry for J. Finally, for X you have to choose an empirical formula to associate with it. As shown, it uses Leucine, but you would need to decide for yourself what to use.
Keep in mind that adding dynamic modifications slows down the MS-GF+ search, so you'll need to be careful about adding several additional dynamic mods.
In order to test this, I updated a handful of proteins in a FASTA file to change certain residues in regions of the proteins that had high scoring peptides in a previous MS-GF+ search. Changes:
I didn't change every occurrence of D, N, I, L, E, or Q; just selected residues. I next searched the dataset against the updated FASTA file and compared to previous search results.
As expected, the search time was 3x longer when looking for B, J, X, or Z residues. That's the big downside to working with FASTA files with ambiguous residues. A bit to my surprise, the scores were nearly identical. Reassuringly, the same peptides were found (with expected changes in residues).
The following table includes some example peptides, comparing the sequence between the unmodified FASTA file (mode 1), and the FASTA file with altered protein sequences (mode 2).
Mode | Peptides to compare | Comment |
---|---|---|
1 | K.DTGKDYDAVNDPGVVSVTEIYNYYKQHGYNTVVMGASFR.N |
|
2 | K.DTGKDYDAVNDPGVVSVTEIYNYYKQHGYNTVVMGASFR.N |
Identical |
1 | K.AFQMPTPSAAPVVGTVGLANGYAVVALDKVNAADSVSDELVNALKQR.L |
|
2 | K.AFQMPTPSAAPVVGTVGLANGYAVVALDKVNAADSVSB+0.984ELVBALKQR.L |
D now B+0.984; N now B |
1 | K.AAFDIAVEHNAVDNWAEMLTFAALVSENETMKPLLTGSLASTK.L |
|
2 | K.AAFB+0.984IAVEHNAVDBWAEMLTFAALVSENETMKPLLTGSLASTK.L |
D now B+0.984; N now B |
1 | R.LANYMNKNPEFTVEIAGHASNVGKPEYNMVLSDKRADAVAK.I |
|
2 | R.LANYMNKNPEFTVEXAGHASNVGKPEYNMVXSDKRADAVAK.I |
I now X; L now X |
1 | R.SQIEADIAAVYAEGPALAMVDSDKGITNLHVPSDIIIDASMPAAIR.S |
|
2 | R.SZIZ+0.984AB+0.984IAAVYAEGPAJAMVDSDKGITBJHVPSDIXIDASMPAAIR.S |
Q now Z; E now Z+0.984; D now B+0.984; N now B; L now J; I now X |
Note that I used the MzidToTSVConverter to convert the .mzid file to TSV format https://github.com/PNNL-Comp-Mass-Spec/Mzid-To-Tsv-Converter/releases
That program inserts mod mass values after residues with dynamic modifications. Giving, for example, B+0.984
Thanks for the extensive information, I'll try it by the end of this week and I'll let you know the result.
Best, Tim
Dear all,
I have a fasta file with ambiguity in the sequences (meaning that next to the 20 standard amino acids, it also contains B, J, U, X and Z - each of them allowing one or more amino acids at a certain place). I was wondering how MSGF+ deals with this? If you run it, it throws a warning that it does contain these letters, but not what happens with these sequences.
Kind regards, Tim