EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

Markdown documentation #36

Closed horta closed 5 years ago

horta commented 5 years ago

Hi again,

documentation/codestyle.md states that you wish to migrate from Latex to Markdown (GFM). Since I making PDF was failing here, I decided to write a script that helps that conversion.

Are you interested in pull requests for those cases that case?

If so, how are you thinking in handling equations and cexcerpts?

If the equations are not very complicated, they can be handled by unicode/ampersand itself:

Latex:

% latex
S(x) =  \frac{\sum_{y \in D(x)}  S(y) } { |D(x)| }

Unicode and HTML ampersand (works with GFM):

S(𝑥) = ∑𝑦 ϵ D(𝑥)S(𝑦) /|D(𝑥)|

Here is an example of a converted .tex without manual editing.

esl_alphabet.md example

The alphabet module contains routines for digitizing alphabetic biosequences.

It is convenient to represent nucleotides and amino acids as array indices 0..3 or 0..19, respectively, for efficiency and other reasons. It is also convenient to index the residues in a biosequence in 1..L coordinates instead of the C language's 0..L-1 array representation, partly for human readability, and also because some codes (dynamic programming alignment algorithms, for example) have boundary conditions where initializing a boundary at coordinate 0 is convenient.

Real biosequences do not consist of just four or twenty differen canonical symbols, though. The alphabet module provides mechanisms for dealing with several other biosequence coding issues:

The alphabet module provides standard defaults for protein, RNA, and DNA alphabets which follow both community standards and IUPAC/IUBMB nomenclature for representing sequence residues in one-letter ASCII characters. Additionally, the design of the alphabet module is flexible enough to allow an application to customize its own alphabet, to deal with these issues almost any way it chooses.

Table~\ref{tbl:alphabet_api} lists the functions in the alphabet API. Easel maintains alphabet information in an ESL\_ALPHABET structure. An application usually creates its alphabet once, possibly even as a global variable. A digitized sequence dsq is an unsigned char * array of length L+2, where dsq[1..L] are digitized residues, and dsq[0] and dsq[L+1] are sentinel bytes (of value eslSENTINEL, 127).

‘esl Create alphabet of standard type.
‘esl Create a custom alphabet.
‘esl Define an equivalent symbol.
‘esl Make an alphabet’s input map case insensitive.
‘esl Define degenerate symbol in custom alphabet.
‘esl Frees an alphabet object.
‘esl Digitize a sequence into new ‘dsq‘ space.
‘esl Digitize a sequence into existing ‘dsq‘ space.
‘esl Calculate avg score of degenerate residue.
‘esl Calculate expected score of degenerate residue.
‘esl Convert internal alphabet type code to output string.
‘esl Count a digital symbol towards a countvector.
‘esl Returns digital code for one ASCII character.
‘esl Returns TRUE given code for a degeneracy.
‘esl Returns TRUE given code for a fundamental residue.
‘esl Returns TRUE given code for a gap.
‘esl Returns TRUE given a degenerate character.
‘esl Returns TRUE given a fundamental residue.
‘esl Returns TRUE given a gap character.

The ‘alphabet‘ API.

An example of using the alphabet API

Figure~\ref{fig:alphabet_example} shows an example of creating a DNA alphabet and digitizing a short DNA sequence.

\begin{figure}
\input{cexcerpts/alphabet_example}
\caption{An example of using the `alphabet` module.}
\label{fig:alphabet_example}
\end{figure}

Concepts and terminology

A symbol is a 7-bit ASCII input character, representing a residue, gap, nonresidue, or degeneracy. A code is the digital internal representation of the symbol as an \ccode{unsigned char} in the range $0..127$, suitable for use as an array index. The alphabet module translates input symbols into internal digital codes.

We distinguish between an input alphabet, an internal alphabet, and a canonical alphabet. The input alphabet consists of all the symbols that Easel allows in an input biosequence character string. The internal alphabet is the standardized one-letter alphabet that Easel deals with. The canonical alphabet is the fundamental set of 4 nucleotides or 20 amino acids.

Easel deals with all of the complications of sequence encoding using two concepts, equivalency and degeneracy. Equivalency defines how the input alphabet maps to the internal alphabet. Degeneracy defines how the internal alphabet maps to the canonical alphabet.

Equivalent residues are symbols that are accepted in an input sequence character string and silently translated into an appropriate internal code. Characters in the input alphabet are mapped many-to-one to the internal alphabet using an input map. One use of equivalency is to map both lower and upper case input to the same internal symbol. Another use is to allow several different input characters to mean a gap, 'any' symbol, or 'nonresidue' symbol. Another use is to silently accept and fix nonstandard but common input errors, such as tolerating the use of X to mean N in nucleic acid sequences.

Degenerate residues are codes in the internal alphabet that are mapped one-to-many onto canonical residue codes, using a \esldef{degeneracy map}. In addition to mapping the degeneracy codes onto the canonical alphabet, the degeneracy mechanism is also used to deal with unusual and modified residues. Selenocysteine, for instance, is represented by default as a U, but treated as a degenerate code for C (cysteine). The rationale for this will be described in more detail below.

The internal alphabe

Easel's internal alphabet is a string (a->sym) of length Kp, which contains:

Residues 0..K-1 must be the canonical alphabet. Residue K must be the gap symbol. Residues K+1..Kp-4 must be the degenerate and modified residue symbols (there can be zero of these). Residue Kp-3 must be the completely degenerate symbol (such as X for protein sequence or N for nucleic acid sequence); all alphabets must have such a symbol. Residue Kp-2 must be the not-a-residue symbol. Residue Kp-1 must be the missing data symbol. Because the 'any' symbol, 'not-a-residue' symbol, and the two kinds of gap symbols are mandatory in any alphabet, Kp $\geq$ K+4. Aside from these constraints, symbols may occur in any order.

The digital code used for each residue is then the index of a residue in this string, 0..Kp-1. The only other value that can appear in a digitized sequence is eslSENTINEL (127), the sentinel byte in positions 0 and L+1 of a digitized sequence of length L.

The rationale for the ordering is the following. Most applications will define residue scores in vectors and matrices that are smaller than the full range of the internal alphabet; for instance, it's common to only have K scores for the canonical residues. As much as possible, we want array indices to be the same whether we're accessing the full internal alphabet or a smaller score vector or matrix. So: we expect many applications to have score vectors or matrices that only contain the K canonical residues, so the canonical residues go first. We expect some applications to trea gaps as an extra symbol, and provide K+1 position-specific scores or a K+1 $\times$ K+1 score matrix, so the gap character is next. We expect a few applications to optimize degeneracy scoring by precalculating them in Kp-2 vectors or $Kp-2 \times Kp-2$ matrices, so the degeneracies go next (the gap character at $K$ might then go unused in the score vectors and matrices, but that's a minor inefficiency). The 'any' symbol should be at a predictable position in the degeneracy list, so it's arbitrarily placed at the end, in position Kp-3. The most robust applications will also handle the not-a-residue symbol (they may see translated stop codons), so it's next. Finally, the missing data symbol is expected to always require special handling when it occurs, rather than appearing in a score vector or matrix, so it's put last.

The standard alphabets: DNA, RNA, protein

The three standard internal alphabets are:

Type ‘sym‘ equivs gaps ‘K‘ ‘Kp‘
‘eslRNA‘ ‘ACGU-RYMKSWHBVDN*` T=U; X=N ‘-_.‘ 4 18
‘eslDNA‘ ‘ACGT-RYMKSWHBVDN*` U=T; X=N ‘-_.‘ 4 18
‘eslAMINO‘ ‘ACDEFGHIKLMNPQRSTVWY-BJZOUX*` ‘-_.‘ 20 29

The sym string contains all the symbols that can be handled internally, and all the residues that can be represented when a digitized sequence is converted back to text. An application migh still convert some characters for its own purposes before displaying an alphabetic string; for instance, to use different gap symbols for insertions versus deletions, or to use upper/lower case conventions to represent match/insert positions in a profile HMM alignment.

The standard DNA and RNA alphabets follow published IUBMB recommendations (Nomenclature for incompletely specified bases in nucleic acid \citep{IUBMB85}), with an addition of X as an equivalence for N (acquiescing to the de facto BLAST filter standard of using X's to mask residues), and equivalencing T to U in RNA sequences (and vice versa in DNA).

The one-letter code for amino acids follows section 3AA-21 of the IUPAC recommendations \citep{IUPAC84}. The code is augmented by U for selenocysteine, as recommended in 1999 by the JCBN/NC-IUBMB Newsletter (\url{http://www.chem.qmul.ac.uk/iubmb/newsletter/1999/item3.html}). It is also augmented by O for pyrrolysine and J for a leucine/isoleucine ambiguity (from a mass spectrometry experiment), following usage in the RESID database (\url{http://www.ebi.ac.uk/RESID/}).

Degenerate residues

The symbols from K+1..Kp-4 in the internal alphabet are all treated as degenerate residues.

When creating a custom alphabet, each degenerate symbol is initialized by calling esl_alphabet_SetDegeneracy(alphabet, c, syms) to assign degenerate alphabetic symbol c to the alphabetic symbols in the string syms. For example, esl_alphabet_SetDegeneracy(a, 'R', \"AG\") assigns R (purine) to mean A or G. For the standard biosequence alphabets, this is done automatically to define the proper degeneracy codes.

For amino acid alphabets, the default code is:

  esl_alphabet_SetDegeneracy(a, 'B', "ND");
  esl_alphabet_SetDegeneracy(a, 'J', "IL");
  esl_alphabet_SetDegeneracy(a, 'Z', "QE");

For RNA alphabets, the default code is:

  esl_alphabet_SetDegeneracy(a, 'R', "AG");
  esl_alphabet_SetDegeneracy(a, 'Y', "CU");
  esl_alphabet_SetDegeneracy(a, 'M', "AC");
  esl_alphabet_SetDegeneracy(a, 'K', "GU");
  esl_alphabet_SetDegeneracy(a, 'S', "CG");
  esl_alphabet_SetDegeneracy(a, 'W', "AU");
  esl_alphabet_SetDegeneracy(a, 'H', "ACU");
  esl_alphabet_SetDegeneracy(a, 'B', "CGU");
  esl_alphabet_SetDegeneracy(a, 'V', "ACG");
  esl_alphabet_SetDegeneracy(a, 'D', "AGU");

For DNA alphabets, the calls are is the same as for RNA code, but with T in place of U.

Implementation: the degeneracy map

The alphabet's degeneracy map is implemented in an array a->degen[0..Kp-1][0..K-1] of 1/0 (TRUE/FALSE) flags. a->degen[x][y] == TRUE indicates that the residue set $D(x)$ for degeneracy code x contains base residue y. a->ndegen[x] contains the cardinality $|D(x)|$, how many base residues are represented by degeneracy code x.

For the two kinds of gap symbols, the degeneracy map is empty; all flags are FALSE and the cardinality is 0.

Because character Kp-3 in the internal alphabet is automatically assumed to be an any character (such as 'N' for DNA or RNA, 'X' for protein), a->degen[Kp-3][i] = 1 for all $i=0..K-1$, and a->ndegen[Kp-3] = K.

The storage of the degeneracy map is a little wasteful. We really only need rows a->degen[K+1..Kp-3], but optimizing this would create some index translation hassles, and it doesn't seem worth it.

Equivalent residues

The concept of equivalent residues allows an input symbol to be mapped to a different internal symbol. One use of equivalence is to map both lower and upper case input to the same internal representation. Another use is to allow several different input characters to mean a gap. Another use is to silently accept and fix nonstandard bu common input errors, such as the use of T instead of U in RNA sequences (or vice versa in DNA), or the use of X instead of N as an ambiguity code in nucleic acid sequences.

The call esl_alphabet_SetEquiv(a, 'U', 'T'), for example, makes an alphabet interpret U as a T (encoding both as 3, in the case of the standard DNA and RNA alphabets).

All three standard alphabets accept \_ or . symbols as equivalences for the standard gap symbol -. An application can define additional gap characters, such as ,, by calling esl_alphabet_SetSynonym(a, ',', '-') on one of the standard alphabets to define additional equivalences (that is, you don't have to create a custom alphabet to add new equivalences).

esl_alphabet_SetCaseInsensitive() maps both upper case and lower case input alphabetic characters map to their equivalent in the internal alphabet in a case-insensitive manner. This function works only on residues that have already been declared to be part of the alphabet, so when defining a custom alphabet, it must be called after all individual equivalences have been defined. The standard alphabets are always set to be case insensitive.

Implementation of equivalent residues: the input map

Internally, an \textbf{input map}, a->inmap[0..127], specifies how an input ASCII 7-bit text symbol is converted to digital code. a->inmap['T'] = 3 in the standard DNA alphabet, for example, and the call esl_alphabet_SetSynonym(a, 'U', 'T') sets a->inmap['U'] = a->inmap['T'].

The elements in input maps are of type unsigned char. Legal values are 0..127 (values that can be cast to the \ccode{unsigned char} codes in a digitized sequence) and two additional flags with negative values, eslILLEGAL\_CHAR (255) and eslIGNORED\_CHAR (254).

Unusual or modified residues

In addition to the canonical 4 or 20 residues and their ambiguity codes, there are many unusual and/or modified residues. For instance, there are many posttranscriptional or posttranslational modifications on residues in RNAs and proteins. Some databases try to capture this information in a single-letter alphabetic code, such as the Sprinzl transfer RNA database \cite{Sprinzl98}.

Additionally, and perhaps more importantly, proteins are known to contain at least two additional genetically encoded amino acids, selenocysteine and pyrrolysine. Selenocysteine is represented by a U according to the IUPAC standard, and pyrrolysine is represented by a O in the RESID database at EBI.

Unusual one-letter residue codes pose a tradeoff issue for sequence analysis applications. On the one hand, an application should recognize symbols for unusual or modified residues, and be able to represent them both internally and in any sequence output. For example, no application should read an input selenocysteine residue (U) and output it as a cysteine (C) -- this changes the original sequence and causes data corruption.\footnote{However, a least one the main public protein databases (UniProt) has already chosen to replace all selenocysteines with C and all pyrrolysines with K, for fear of breaking legacy sequence analysis software. So, this data corruption is already a fact of life.} On the other hand, most sequence analysis applications would not want to take the trouble to define a canonical alphabet larger than the usual 4 or 20 residues, and then have to parameterize tha alphabet, just to be able to handle a few rare residues. (Pyrrolysine, for example, has only been found in a handful of proteins in a few Archaea.) It is useful to be able to deal with probability parameters and scores only for the canonical alphabet. However (on yet another hand!) in some cases one would want to write a specialized application that parameterizes unusual residues as part of its canonical alphabet -- for instance, an application for analyzing posttranscriptional tRNA modifications, for example.

Therefore, Easel must not force an input selenocysteine or pyrrolysine (or any other unusual residue) to be recoded as an arbitrary symbol (such as cysteine or lysine). That is, unusual symbols cannot be treated as equivalences, but must be allowed to be part of the internal alphabet. However, Easel can allow unusual symbols to be treated as noncanonical, and score them as some other arbitrary residue, as a reasonable approximation. Thus for mos purposes, unusual symbols are best handled as a special kind of degeneracy, with a one-to-one degeneracy map from the unusual symbol to the closest canonical residue.

Therefore, the default amino acid alphabet accepts selenocysteine (U) and pyrrolysine symbols (O) and represents them in the internal alphabet, and maps them as degeneracies onto cysteine (C) and lysine (K), respectively.

When that behavior is not suitable, an application can also define any custom alphabet it chooses, as described below.

Creating a custom alphabe

Figure~\ref{fig:alphabet_example2} shows an example of creating a customized 22-letter amino acid alphabet that includes the U selenocysteine code and the O pyrrolysine code.

\begin{figure}
\input{cexcerpts/alphabet_example2}
\caption{An example of creating a custom alphabet.}
\label{fig:alphabet_example2}
\end{figure}

Scoring degenerate residues

To score a degenerate residue code $x$, Easel provides two strategies. One set of functions assigns an average score:

  S(x) =  \frac{\sum_{y \in D(x)}  S(y) } { |D(x)| },

where $D(x)$ is the set of residues $y$ represented by degeneracy code $x$ (for example, $D(\mbox{R}) = { \mbox{A,G} }$), $| D(x) |$ is the number of residues that the degeneracy code includes, and $S(y)$ is the score of a base residue $y$. Because scores $S(y)$ are commonly kept as integers, floats, or doubles, depending on the application, three functions are provided that differ only in the storage type of the scores: esl_abc_IAvgScore(a,x,sc), esl_abc_FAvgScore(a,x,sc), and esl_abc_DAvgScore(a,x,sc) calculate and return the average score of residue x in alphabet a given a base score vector sc[0]..sc[K-1] for integers, floats, and doubles, respectively.

A second set of functions assigns an expected score, weighted by an expected occurrence probability $p_y$ of the residues $y$ (often the random background frequencies):

  S(x) =  \frac{\sum_{y \in D(x)}  p_y S(y) } { \sum_{y \in D(x)} p_y },

These three functions are esl_abc_IExpectScore(a,x,sc,p), esl_abc_FExpectScore(a,x,sc,p), and esl_abc_DExpectScore(a,x,sc,p). For the integer and floa versions, the probability vector is in floats; for the double version, the probability vector is in doubles.

For efficiency reasons, an application might choose to preculate scores for all possible degenerate codes it might see. HMMER, for example, turns probability vectors of length K into score residues of length Kp.

An application might also choose to score residues on-the-fly, using score vectors of length K. Each input residue x would then have to be tested to see if it is degenerate, before scoring i appropriately. esl_abc_IsBasic(a, x) returns TRUE if x is in the basic set of K residues in alphabe a, and FALSE otherwise. Similarly, esl_abc_IsGap(a,x) tests whether $x$ is a gap, and esl_abc_IsDegenerate(a,x) tests whether $x$ is a degenerate residue.

cryptogenomicon commented 5 years ago

I would only be interested in this if it could be done with next-to-zero time investment on my part, which I don't think is possible, because the conversion will not be complete. So it's not a high priority right now, since priority is on more substantial issues (HMMER4), but thanks. I'm happy to continue the way I'm going, which is to slowly move toward Markdown and away from .tex.