Ambiguity in fasta files

tivdnbos commented 6 years ago

Dear all,

I have a fasta file with ambiguity in the sequences (meaning that next to the 20 standard amino acids, it also contains B, J, U, X and Z - each of them allowing one or more amino acids at a certain place). I was wondering how MSGF+ deals with this? If you run it, it throws a warning that it does contain these letters, but not what happens with these sequences.

Kind regards, Tim

alchemistmatt commented 6 years ago

MSGF+ treats those letters as "stop points", meaning it will never consider a peptide that contains any of those letters. Similarly, if you have asterisks (*) in your protein sequences, those too are treated as stop points. When MSGF+ indexes a FASTA file it parses the protein sequences to create candidate peptides to be compared against the spectra.

Here's an example. Given a protein that contains a J and an *: FPTDDDDKIVJGGYTCAANSIPY*QVSLNSGS

If we parse this protein into candidate peptides, with a minimum length of 6 residues per peptide, we get these possible peptides (placed in columns to save space):

FPTDDD         GGYTCA           YTCAAN         QVSLNS
FPTDDDD        GGYTCAA          YTCAANS        QVSLNSG
FPTDDDDK       GGYTCAAN         YTCAANSI       QVSLNSGS
FPTDDDDKI      GGYTCAANS        YTCAANSIP      VSLNSG
FPTDDDDKIV     GGYTCAANSI       YTCAANSIPY     VSLNSGS
PTDDDD         GGYTCAANSIP      TCAANS         SLNSGS
PTDDDDK        GGYTCAANSIPY     TCAANSI
PTDDDDKI       GYTCAA           TCAANSIP
PTDDDDKIV      GYTCAAN          TCAANSIPY
TDDDDK         GYTCAANS         CAANSI
TDDDDKI        GYTCAANSI        CAANSIP
TDDDDKIV       GYTCAANSIP       CAANSIPY
DDDDKI         GYTCAANSIPY      AANSIP
DDDDKIV                         AANSIPY
DDDKIV                          ANSIPY

alchemistmatt commented 6 years ago

Oops, missed the a few peptides, like FPTDDD and GGYTCA but you should get the idea.

alchemistmatt commented 6 years ago

Also, to clarify, MSGF+ recognizes the 20 standard amino acids: A C D E F G H I K L M N P Q R S T V W Y It treats the non-standard amino acids (B J O U X Z) plus any symbols (and any spaces) as "stop points" (term I mentioned in my first response).

Note also that you can define a custom amino acid, if you wish, by adding a line with custom to the ModificationFileName specified with the -m switch. Two examples: 1) Assume any J residue in the FASTA file is actually a Histidine:

C6H7N3O1,J,custom,H,Histidine    # Histidine masquerading as J

2) Use U for Selenocysteine

C3H5NO,  U, custom, U, Selenocysteine  # Custom amino acids can only have C, H, N, O, and S
79.9166, U, fix, any, Se80             # Use a static mod to add Se

I need to update the help pages to include these examples.

tivdnbos commented 6 years ago

Thanks for the explanation! Is there also a way to add ambiguity (i.e. the non-standard amino acids) to ModificationFileName? These are: B = D or N, J = I or L, X = unknown, Z = E or Q

alchemistmatt commented 6 years ago

I need to test this, but, in theory, placing the following in the mods.txt file should do what you want (I included oxidized methionine; you can remove that if you want).

C4H6N2O2,   B,  custom, D,    AspOrAsn    # Asparagine; see dynamic mod below for 0.984 to get Aspartic acid
C6H11NO,    J,  custom, IorL, IleOrLeu    # Isoleucine or Leucine
C6H11NO,    X,  custom, Unk,  UnknownAA   # Unknown residue; use the empirical formula for Leucine for any X
C5H8N2O2,   Z,  custom, Q,    Glutamine   # Glutamine; see dynamic mod below for 0.984 to get Glutamic Acid

H-1N-1O, B, opt, any, Deamidated          # Add 0.984 to B (aka N) to get D
H-1N-1O, Z, opt, any, Deamidated          # Add 0.984 to Z (aka Q) to get E
O1,      M, opt, any, Oxidation           # Oxidized methionine

Explaining things: the first custom amino acid entry associates the empirical formula of Asparagine (N) with B, then the dynamic mod H-1N-1O will add 0.984 to B to get D. Similarly, there is a custom amino acid entry assigning the empirical formula of Glutamine (Q) to Z, then we optionally add 0.984 to that to get E. Since I and L have the same empirical formula, there is just one entry for J. Finally, for X you have to choose an empirical formula to associate with it. As shown, it uses Leucine, but you would need to decide for yourself what to use.

Keep in mind that adding dynamic modifications slows down the MS-GF+ search, so you'll need to be careful about adding several additional dynamic mods.

alchemistmatt commented 6 years ago

In order to test this, I updated a handful of proteins in a FASTA file to change certain residues in regions of the proteins that had high scoring peptides in a previous MS-GF+ search. Changes:

Change D or N to B
Change I or L to J
Change L to X
Change E or Q to Z

I didn't change every occurrence of D, N, I, L, E, or Q; just selected residues. I next searched the dataset against the updated FASTA file and compared to previous search results.

As expected, the search time was 3x longer when looking for B, J, X, or Z residues. That's the big downside to working with FASTA files with ambiguous residues. A bit to my surprise, the scores were nearly identical. Reassuringly, the same peptides were found (with expected changes in residues).

The following table includes some example peptides, comparing the sequence between the unmodified FASTA file (mode 1), and the FASTA file with altered protein sequences (mode 2).

Mode	Peptides to compare	Comment
1	`K.DTGKDYDAVNDPGVVSVTEIYNYYKQHGYNTVVMGASFR.N`
2	`K.DTGKDYDAVNDPGVVSVTEIYNYYKQHGYNTVVMGASFR.N`	Identical
1	`K.AFQMPTPSAAPVVGTVGLANGYAVVALDKVNAADSVSDELVNALKQR.L`
2	`K.AFQMPTPSAAPVVGTVGLANGYAVVALDKVNAADSVSB+0.984ELVBALKQR.L`	D now B+0.984; N now B
1	`K.AAFDIAVEHNAVDNWAEMLTFAALVSENETMKPLLTGSLASTK.L`
2	`K.AAFB+0.984IAVEHNAVDBWAEMLTFAALVSENETMKPLLTGSLASTK.L`	D now B+0.984; N now B
1	`R.LANYMNKNPEFTVEIAGHASNVGKPEYNMVLSDKRADAVAK.I`
2	`R.LANYMNKNPEFTVEXAGHASNVGKPEYNMVXSDKRADAVAK.I`	I now X; L now X
1	`R.SQIEADIAAVYAEGPALAMVDSDKGITNLHVPSDIIIDASMPAAIR.S`
2	`R.SZIZ+0.984AB+0.984IAAVYAEGPAJAMVDSDKGITBJHVPSDIXIDASMPAAIR.S`	Q now Z; E now Z+0.984; D now B+0.984; N now B; L now J; I now X

Note that I used the MzidToTSVConverter to convert the .mzid file to TSV format https://github.com/PNNL-Comp-Mass-Spec/Mzid-To-Tsv-Converter/releases

That program inserts mod mass values after residues with dynamic modifications. Giving, for example, B+0.984

tivdnbos commented 6 years ago

Thanks for the extensive information, I'll try it by the end of this week and I'll let you know the result.

Best, Tim

MSGFPlus / msgfplus

Ambiguity in fasta files #21