Merge BioSeq.jl into the Seq module

dcjones commented 10 years ago

I've been doing some hand-wringing over sequence representations. Here's a long rant with my thoughts about this. This might be a controversial plan, so I want to give people a chance to comment before I get too far trying to implement anything. I hope to take as much code as I can from BioSeq.jl, but obviously this is somewhat of a departure.

Sequence types should reflect the difference between sequence patterns, sequence alignments, and sequences. Sequences are strings over alphabets representing DNA, RNA, or AA. Sequence patterns contain ambiguity codes and represent motifs that a sequence might match. A sequence can always be converted to a sequence pattern, but not necessarily vice versa, similar to the relationship between strings and regular expressions. Finally, aligned sequences can contain indications of gaps and deletions.

The conflation of these things increases the complexity of other libraries and leads to programming errors. Separating these concepts will lead to a more Julian design – allowing one to dispatch on a more specific type and throw errors when sequence-like types are used incorrectly together.
Nucleotide sequences should be represented in 2-bit encoding. Indexing becomes slightly more difficult, in exchange for greatly decreased memory use. Furthermore, 2-bit representations allow for extremely efficient reverse-complement operations using lookup tables, equality comparisons, enumeration of k-mers, indexing into substitution score matrices, etc.
Amino acid sequences should be represented using 8-bit encoding. Although we could use 5-bit encoding, the space savings would be small and the indexing would be complicated by symbols spanning byte boundaries.
Amino acid symbols should not use ASCII character representation, but rather the bytes 0x01...0x14 This will also aid in matching against motifs and indexing into substitution score matrices. See point 5.
Sequence patterns should be represented as a bit masks. A symbol in a nucleotide sequence pattern is a bitmask 0b0000, 0b0001, 0b0010, ..., 0b1111. Each bit corresponds to the nucleotide that it matches. Similarly, symbols in amino acid patterns are represented an a 32-bit masks. This will allow for extremely efficient pattern matching code. IUPAC ambiguity codes are translated to bit masks during parsing.
N (is in NA) is allowable in all sequence and pattern types. The semantics of N will depend on the operation.
N is represented in amino acid sequences with 0x15 and in nucleotide sequences using a mask. . 8-bit AA sequences have extra bits so we use a special value. 2-bit sequences have no bits to spare, but more importantly, Ns tend to occur in large blocks in nucleotide sequences, creating an opportunity to implement an efficient N mask using run-length encoding or an interval tree.
Aligned sequences are represented by a sequence paired with a list of edit operations. This makes it trivial to obtain the sequence independent of the alignment. It can also be a more space efficient representation than storing symbols to represent gaps, etc. E.g. aligned RNA-Seq reads often span exons, so thousands of - symbols in a 75nt read is pretty inefficient.

I don't know the full gamut of what people do with software representations of sequences, so let me know if there's a use case that this plan won't handle gracefully.

One issue is unusual alphabets: are there legitimate cases where custom alphabets are needed? Maybe I needs to be represented in an A-to-I editing experiment? I haven't made up my mind whether we should support that.

diegozea commented 10 years ago

At this moment BioSeq.jl have a 2-bit encoding DNA sequence with the type DNA2Seq. But... I don’t know if is always worthy use a 2-bit representation. I see, at least, three diferent kind of problems when you are working with biological data:

Problems where you need to be fast in IO operations. A lot of formats are based on some kind of strings representation (ASCII encoded). This is the reason for BioPerl being popular; Perl is fast and easy for working with strings. 8-bit -> 2-bit -> 8-bit or 8-bit -> other 8-bit representation -> 8-bit don’t look the best option on this kind of problems.
Problems when you need to operate fast on sequences. In this problems, alternative representations can be better. The ape coding scheme or the representation of aminoacid from 1 to 20 in order to get faster indexation of matrices. I try to solve this on Bio seq.jl using the Alphabet type.
Problems where you need to have a lot of data on memory. Here a 2-bit representation of nucleic acids is the best option.

dcjones commented 10 years ago

My suspicion is that converting between 8-bit and 2-bit is not going to be a bottleneck. But I'll put together a benchmark so we actually have some numbers. I think a bigger reason BioPerl doesn't use 2-bit encoding is that it's hard to do efficiently in Perl without dropping into C. In Julia we can make it both fast and convenient.

I agree that we should have a representation like what ape uses, but I want to distinguish it from regular sequences. So I'm thinking we have

DNASequence: two bit encoded regular sequence.
DNAPattern: bit mask encoded pattern, using a an encoding similar to ape
DNAAlignment: a sequence with gap and indel information.

The main reason for this is that I want to be able to write a function.

function foo(x::DNASequence)
    ...
end

and know that when I index into x I'll get a nucleotide and not a gap or an ambiguity code.

Case in point: I did some quick grepping through the the BioPython sources and counted 73 places where they have to explicitly check that a sequence has the right alphabet, usually checking that it doesn't have gaps and is unambiguous. We should try to leverage Julia's type system to avoid that ugliness.

prcastro commented 9 years ago

Seq module is up and running, so I believe it's time to close this issue. What do you think?

BioJulia / Bio.jl

Merge BioSeq.jl into the Seq module #6