czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

What to do about B and Z amino acid letters? #78

Open olgabot opened 4 years ago

olgabot commented 4 years ago

Motivation

In working on this PR: https://github.com/czbiohub/sencha/pull/74, I've added the debug=False option to sencha.index.make_protein_index and discovered a few issues with creating an index on real data, specifically amino acid characters beyond the usual 20-letter alphabet:

/home/olga/code/sencha/sencha/index.py - 2020-06-04 18:23:07,875 DEBUG: The k-mer "BFDKVSNEP" contained non-amino acid characters: B, skipping
The k-mer "KIYIGTPPZ" contained non-amino acid characters: Z, skipping
/home/olga/code/sencha/sencha/index.py - 2020-06-04 18:23:07,876 DEBUG: The k-mer "KIYIGTPPZ" contained non-amino acid characters: Z, skipping

Here's an amino acid to 3-letter code to 1-letter code table stolen from here:

Amino acid Three letter code One letter code
alanine ala A
arginine arg R
asparagine asn N
aspartic acid asp D
asparagine or aspartic acid asx B
cysteine cys C
glutamic acid glu E
glutamine gln Q
glutamine or glutamic acid glx Z
glycine gly G
histidine his H
isoleucine ile I
leucine leu L
lysine lys K
methionine met M
phenylalanine phe F
proline pro P
serine ser S
threonine thr T
tryptophan trp W
tyrosine tyr Y
valine val V

Here it mentions how asparagine (N) + aspartic acid (D), and glutamine (Q) + glutamic acid (E) may not be distinguishable. Furthermore, this page mentions specifically:

Sometimes it is not possible two differentiate two closely related amino acids, therefore we have the special cases:

  • asparagine/aspartic acid - asx - B
  • glutamine/glutamic acid - glx - Z

Which means:

Biologically, this can happen because the protein sequences are validated using Mass Spectrometry, and the difference in mass between asparagine (N) and aspartic acid (D), or glutamine (Q) and glutamic acid (E), may be undetectable. So the researchers use the ambiguous letter to represent the residue could be either of the two amino acids.

What is the effect?

Well, it may not have too much of an effect for some reduced alphabets. For example, the current Dayhoff mapping, D, E, N and Q all map to the same category:

DAYHOFF_MAPPING = {
...
    # Acid and amide
    "D": "c",
    "E": "c",
    "N": "c",
    "Q": "c",
...

https://github.com/czbiohub/sencha/blob/7b63521a6da9216aeabea42a512115678261cd43/sencha/sequence_encodings.py#L32

However, this may is not true for all alphabets, e.g for SDM12, they all map to different categories:

SDM12_MAPPING = {
...
    "D": "b",
...
    "E": "c",
...
    "N": "d",
...
    "Q": "e",
...

https://github.com/czbiohub/sencha/blob/7b63521a6da9216aeabea42a512115678261cd43/sencha/sequence_encodings.py#L165

What to do about this?

Some options are:

  1. Ignore all k-mers containing B or Z
  2. Randomly choose one of D or N for B, and E or Q for Z
    1. E.g. for the k-mer ABA, randomly choose one of ADA and ANA to add
    2. E.g. for the k-mer AZA, randomly choose one of AQA and AEA to add
  3. Add both versions of the replacement.
    1. E.g. for the k-mer ABA, add both ADA and ANA
    2. E.g. for the k-mer AZA, add both AQA and AEA

Thoughts? cc @bluegenes