In working on this PR: https://github.com/czbiohub/sencha/pull/74, I've added the debug=False option to sencha.index.make_protein_index and discovered a few issues with creating an index on real data, specifically amino acid characters beyond the usual 20-letter alphabet:
Here's an amino acid to 3-letter code to 1-letter code table stolen from here:
Amino acid
Three letter code
One letter code
alanine
ala
A
arginine
arg
R
asparagine
asn
N
aspartic acid
asp
D
asparagine or aspartic acid
asx
B
cysteine
cys
C
glutamic acid
glu
E
glutamine
gln
Q
glutamine or glutamic acid
glx
Z
glycine
gly
G
histidine
his
H
isoleucine
ile
I
leucine
leu
L
lysine
lys
K
methionine
met
M
phenylalanine
phe
F
proline
pro
P
serine
ser
S
threonine
thr
T
tryptophan
trp
W
tyrosine
tyr
Y
valine
val
V
Here it mentions how asparagine (N) + aspartic acid (D), and glutamine (Q) + glutamic acid (E) may not be distinguishable. Furthermore, this page mentions specifically:
Sometimes it is not possible two differentiate two closely related amino acids, therefore we have the special cases:
asparagine/aspartic acid - asx - B
glutamine/glutamic acid - glx - Z
Which means:
B --> D or N
Z --> E or Q
Biologically, this can happen because the protein sequences are validated using Mass Spectrometry, and the difference in mass between asparagine (N) and aspartic acid (D), or glutamine (Q) and glutamic acid (E), may be undetectable. So the researchers use the ambiguous letter to represent the residue could be either of the two amino acids.
What is the effect?
Well, it may not have too much of an effect for some reduced alphabets. For example, the current Dayhoff mapping, D, E, N and Q all map to the same category:
Motivation
In working on this PR: https://github.com/czbiohub/sencha/pull/74, I've added the
debug=False
option tosencha.index.make_protein_index
and discovered a few issues with creating an index on real data, specifically amino acid characters beyond the usual 20-letter alphabet:Here's an amino acid to 3-letter code to 1-letter code table stolen from here:
Here it mentions how asparagine (N) + aspartic acid (D), and glutamine (Q) + glutamic acid (E) may not be distinguishable. Furthermore, this page mentions specifically:
Which means:
B
-->D
orN
Z
-->E
orQ
Biologically, this can happen because the protein sequences are validated using Mass Spectrometry, and the difference in mass between asparagine (N) and aspartic acid (D), or glutamine (Q) and glutamic acid (E), may be undetectable. So the researchers use the ambiguous letter to represent the residue could be either of the two amino acids.
What is the effect?
Well, it may not have too much of an effect for some reduced alphabets. For example, the current Dayhoff mapping,
D
,E
,N
andQ
all map to the same category:https://github.com/czbiohub/sencha/blob/7b63521a6da9216aeabea42a512115678261cd43/sencha/sequence_encodings.py#L32
However, this may is not true for all alphabets, e.g for SDM12, they all map to different categories:
https://github.com/czbiohub/sencha/blob/7b63521a6da9216aeabea42a512115678261cd43/sencha/sequence_encodings.py#L165
What to do about this?
Some options are:
D
orN
forB
, andE
orQ
forZ
ABA
, randomly choose one ofADA
andANA
to addAZA
, randomly choose one ofAQA
andAEA
to addABA
, add bothADA
andANA
AZA
, add bothAQA
andAEA
Thoughts? cc @bluegenes