Closed dcjones closed 9 years ago
At this moment BioSeq.jl have a 2-bit encoding DNA sequence with the type DNA2Seq. But... I don’t know if is always worthy use a 2-bit representation. I see, at least, three diferent kind of problems when you are working with biological data:
My suspicion is that converting between 8-bit and 2-bit is not going to be a bottleneck. But I'll put together a benchmark so we actually have some numbers. I think a bigger reason BioPerl doesn't use 2-bit encoding is that it's hard to do efficiently in Perl without dropping into C. In Julia we can make it both fast and convenient.
I agree that we should have a representation like what ape uses, but I want to distinguish it from regular sequences. So I'm thinking we have
DNASequence
: two bit encoded regular sequence.DNAPattern
: bit mask encoded pattern, using a an encoding similar to apeDNAAlignment
: a sequence with gap and indel information.The main reason for this is that I want to be able to write a function.
function foo(x::DNASequence)
...
end
and know that when I index into x
I'll get a nucleotide and not a gap or an ambiguity code.
Case in point: I did some quick grepping through the the BioPython sources and counted 73 places where they have to explicitly check that a sequence has the right alphabet, usually checking that it doesn't have gaps and is unambiguous. We should try to leverage Julia's type system to avoid that ugliness.
Seq module is up and running, so I believe it's time to close this issue. What do you think?
I've been doing some hand-wringing over sequence representations. Here's a long rant with my thoughts about this. This might be a controversial plan, so I want to give people a chance to comment before I get too far trying to implement anything. I hope to take as much code as I can from BioSeq.jl, but obviously this is somewhat of a departure.
Sequence types should reflect the difference between sequence patterns, sequence alignments, and sequences. Sequences are strings over alphabets representing DNA, RNA, or AA. Sequence patterns contain ambiguity codes and represent motifs that a sequence might match. A sequence can always be converted to a sequence pattern, but not necessarily vice versa, similar to the relationship between strings and regular expressions. Finally, aligned sequences can contain indications of gaps and deletions.
The conflation of these things increases the complexity of other libraries and leads to programming errors. Separating these concepts will lead to a more Julian design – allowing one to dispatch on a more specific type and throw errors when sequence-like types are used incorrectly together.
0x01...0x14
This will also aid in matching against motifs and indexing into substitution score matrices. See point 5.0b0000, 0b0001, 0b0010, ..., 0b1111
. Each bit corresponds to the nucleotide that it matches. Similarly, symbols in amino acid patterns are represented an a 32-bit masks. This will allow for extremely efficient pattern matching code. IUPAC ambiguity codes are translated to bit masks during parsing.N
(is inNA
) is allowable in all sequence and pattern types. The semantics ofN
will depend on the operation.N
is represented in amino acid sequences with 0x15 and in nucleotide sequences using a mask. . 8-bit AA sequences have extra bits so we use a special value. 2-bit sequences have no bits to spare, but more importantly, Ns tend to occur in large blocks in nucleotide sequences, creating an opportunity to implement an efficient N mask using run-length encoding or an interval tree.-
symbols in a 75nt read is pretty inefficient.I don't know the full gamut of what people do with software representations of sequences, so let me know if there's a use case that this plan won't handle gracefully.
One issue is unusual alphabets: are there legitimate cases where custom alphabets are needed? Maybe
I
needs to be represented in an A-to-I editing experiment? I haven't made up my mind whether we should support that.