What to do about B and Z amino acid letters?

Motivation

In working on this PR: https://github.com/czbiohub/sencha/pull/74, I've added the debug=False option to sencha.index.make_protein_index and discovered a few issues with creating an index on real data, specifically amino acid characters beyond the usual 20-letter alphabet:

/home/olga/code/sencha/sencha/index.py - 2020-06-04 18:23:07,875 DEBUG: The k-mer "BFDKVSNEP" contained non-amino acid characters: B, skipping
The k-mer "KIYIGTPPZ" contained non-amino acid characters: Z, skipping
/home/olga/code/sencha/sencha/index.py - 2020-06-04 18:23:07,876 DEBUG: The k-mer "KIYIGTPPZ" contained non-amino acid characters: Z, skipping

Here's an amino acid to 3-letter code to 1-letter code table stolen from here:

Amino acid	Three letter code	One letter code
alanine	ala	A
arginine	arg	R
asparagine	asn	N
aspartic acid	asp	D
asparagine or aspartic acid	asx	B
cysteine	cys	C
glutamic acid	glu	E
glutamine	gln	Q
glutamine or glutamic acid	glx	Z
glycine	gly	G
histidine	his	H
isoleucine	ile	I
leucine	leu	L
lysine	lys	K
methionine	met	M
phenylalanine	phe	F
proline	pro	P
serine	ser	S
threonine	thr	T
tryptophan	trp	W
tyrosine	tyr	Y
valine	val	V

Here it mentions how asparagine (N) + aspartic acid (D), and glutamine (Q) + glutamic acid (E) may not be distinguishable. Furthermore, this page mentions specifically:

Sometimes it is not possible two differentiate two closely related amino acids, therefore we have the special cases:

asparagine/aspartic acid - asx - B

glutamine/glutamic acid - glx - Z

Which means:

B --> D or N
Z --> E or Q

Biologically, this can happen because the protein sequences are validated using Mass Spectrometry, and the difference in mass between asparagine (N) and aspartic acid (D), or glutamine (Q) and glutamic acid (E), may be undetectable. So the researchers use the ambiguous letter to represent the residue could be either of the two amino acids.

What is the effect?

Well, it may not have too much of an effect for some reduced alphabets. For example, the current Dayhoff mapping, D, E, N and Q all map to the same category:

DAYHOFF_MAPPING = {
...
    # Acid and amide
    "D": "c",
    "E": "c",
    "N": "c",
    "Q": "c",
...

https://github.com/czbiohub/sencha/blob/7b63521a6da9216aeabea42a512115678261cd43/sencha/sequence_encodings.py#L32

However, this may is not true for all alphabets, e.g for SDM12, they all map to different categories:

SDM12_MAPPING = {
...
    "D": "b",
...
    "E": "c",
...
    "N": "d",
...
    "Q": "e",
...

https://github.com/czbiohub/sencha/blob/7b63521a6da9216aeabea42a512115678261cd43/sencha/sequence_encodings.py#L165

What to do about this?

Some options are:

Ignore all k-mers containing B or Z
Randomly choose one of D or N for B, and E or Q for Z
1. E.g. for the k-mer ABA, randomly choose one of ADA and ANA to add
2. E.g. for the k-mer AZA, randomly choose one of AQA and AEA to add
Add both versions of the replacement.
1. E.g. for the k-mer ABA, add both ADA and ANA
2. E.g. for the k-mer AZA, add both AQA and AEA

Thoughts? cc @bluegenes

czbiohub-sf / orpheum