Alignment - Support the Padding operation.

TransGirlCodes commented 8 years ago

TL;DR - Is anything preventing us from supporting CIGAR Pad operations right now?

In my line of work, alignments with software like Clustal generate reference-less alignments. E.g.

Seq1 CGATCA--GACCGATA Seq2 CGATCAGAGACCGATA Seq3 CGATCA-AGACCGATAC

This kind of alignment is important to phylogenetics and evolutionary study. My perception (anyone may disagree) is that this was common when we study gene models, fasta formatted files output from Clustal in evolutionary study, and when we use programs associated with evolutionary study (MEGA, Clustal, Mr Bayes) they all accepted this kind of "Multiple Alignment Fasta" (I'm told MAF means something slightly different to HT Bioinformaticians) representation of alignments. But with HT Sequencing the reference approach became popular and so when I ask people about multiple alignment fasta files - I get weird looks, even when I say "You know - that thing Clustal outputs!". As an evolutionary biologist this multiple aligned fasta format is natural to me - why should one sequence be a reference to the others? They are all independently evolved! (and people are moving away from reference based thinking as pan-genomics grows).

Recently I tried to experiment and adapt the fasta parser for this kind of gapped fasta file, using the alignment types in the align module, but this required padding operations, and I hit an error saying Padding is not supported yet. Why is it not supported, and what is preventing it to be supported at the moment? What needs to be done (I'm asking because I will do some of it!)?

bicycle1885 commented 8 years ago

I agree that we need to be able to parse and load gapped sequences from a FASTA file. I often use a set of aligned amino acid sequences generated from the BLAST search. If we use 4-bit encoding as I proposed in #122, we can reserve a symbol for a gap character; since IUPAC specifies 15 base codes, we can use the last character for gaps.

Multiple alignment can be represented by picking an arbitrary sequence as a reference, and aligning other sequences to it. But this does not handle all sequences equally, hence I don't like the idea. One idea I have is that introducing a pseudo sequence data structure and aligning all sequence to it. For example, assume that we have three sequence (S1, S2, and S3) and they are aligned by a multiple alignment algorithm:

S1: ACT--GT
S2: -CTTTGT
S3: AGTT-GG

Then introduce a pseudo sequence (P):

 P: -------
S1: ACT--GT
S2: -CTTTGT
S3: AGTT-GG

Now we can represent this multiple alignment as a set of three pairwise alignments by setting P as a reference sequence.

TransGirlCodes commented 8 years ago

An additional question - have the indexing variables changed meaning? The variables seqpos and refpos originally stored the end position of an operation, with reference to the source position. I.e. If a sequence unaligned was:

ATCGATCG

But Aligned looked like

ATCG---ATCG

Then the anchors would be: [viewPos: 4, sourcePos: 4, Op: M] [viewPos: 7, sourcePos: 4, Op: D] [viewPos: 12, sourcePos: 8]

The anchor co-ords map the transformation from one view of the sequence (the raw sequence/read) to another (the view of that read/raw sequence when the ops have been applied). Has this changed? Because if you have a sequence with insertions:

julia> aln = Alignment("6=2I8=")
······--········
················

The anchors are:

julia> aln.anchors
4-element Array{Bio.Align.AlignmentAnchor,1}:
 AlignmentAnchor(0, 0, '0')  
 AlignmentAnchor(6, 6, '=')  
 AlignmentAnchor(8, 6, 'I')  
 AlignmentAnchor(16, 14, '=')

To my mind, if the co-ords mapped the transformation of the unaltered sequence to the view of the sequence when aligned, shouldn't the insertion be AlignmentAnchor(8, 8, 'I')? This is what makes me think maybe the meaning of the anchor co-ords has changed.

TransGirlCodes commented 8 years ago

The idea of a pseudo sequence is intriguing, if it works It would definitely be advantageous. But before we do anything, what are the benefits of say using a pseudo sequence as opposed to the cigar P operations? P was designed to capture information as to how insertions and deletions relative to the reference, mapped against the other reads e.g.

Ref: TCA__GAC 1: TCAGAGAC 2: TCA_AGAC 3: TCA__GAC

In seq 1, GA is an insert, in seq 2 A is an insert, and in seq 3 there is no insert, or deletion to the reference.

The usual CIGARs: 1: 3M2I3M 2: 3M1I3M 3: 6M

Do not capture this, they don't capture how deletions and insertions map to one another, using pads it is achieved:

1:3M2I3M 2:3M1P1I3M 3: 3M2P3M

I'm wondering if for MSA, some hybrid approach is good, where we use a pseudo ref - perhaps the consensus? And then we also use P operations.

bicycle1885 commented 8 years ago

An additional question - have the indexing variables changed meaning? The variables seqpos and refpos originally stored the end position of an operation, with reference to the source position.

No, as far as I'm aware.

For "6=2I8=" CIGAR string, you can visually interpret the alignment like this:

     0        1
     1234567890123456
seq: ACGTACGTACGTACGT
     ||||||  ||||||||
ref: ACGTAC--ACGTACGT
     0          1
     1234566678901234

If the insertion was AlignmentAnchor(8, 8, 'I'), 8th character of seq and 8th character of ref should match. But the 8th character (C) of ref is matching the 10th character (C) of seq.

TransGirlCodes commented 8 years ago

I think the meaning has changed then. For every anchor, one variable marks the position in the reference, and one marks the position in the sequence? So it maps co-ordinates between two sequences?

In the Alignment Anchor representation before the redux, it was different. Rather than map the position between an aligned sequence and a reference sequence. You mapped between the unaltered DNA sequence, and what that DNA sequence "looks like" when aligned, co-ordinates were mapped between what was called the "source sequence/space", and the "alignment view/space" (http://seqan.readthedocs.org/en/master/Tutorial/AlignmentRepresentation.html#gap-space-vs-source-space). In other words, say you had the following read:

Unaligned Read: CDFGDC

You aligned it to this reference: Reference: CDEFGAHGC

The alignment looks like this:

Aligned Read: CD_FG__DC
              || ||   |
Ref:          CDEFGAHGC

Now the Redux has happened, the anchors map a position in the Aligned Read to the Ref?

Before the Redux, the anchors mapped a position in "Unaligned Read", to "AlignedRead", e.g. pos 4 in "Unaligned Read" is pos 5 in the "Aligned Read".

This is why I expected the anchor previously to be AlignmentAnchor(8, 8, 'I'), because there is no difference between the aligned read and the unaligned read. I.e:

Unaligned Read: CGATCAGAGACCGATAC

Unaligned Reference: CGATCAGACCGATAC

Aligned:

Read:      CGATCAGAGACCGATAC

Reference: CGATCA__GACCGATAC

You see the Read aligned, looks exactly the same to the read unaligned. So if anchors mapped source positions to gap positions. I'd expect the following anchors:

4-element Array{Bio.Align.AlignmentAnchor,1}:
 AlignmentAnchor(0, 0, '0')  
 AlignmentAnchor(6, 6, '=')  
 AlignmentAnchor(8, 8, 'I')  
 AlignmentAnchor(16, 16, '=')

But if the anchors maps co-ords between a sequence, and it's reference, you'd expect this:

4-element Array{Bio.Align.AlignmentAnchor,1}:
 AlignmentAnchor(0, 0, '0')  
 AlignmentAnchor(6, 6, '=')  
 AlignmentAnchor(8, 6, 'I')  
 AlignmentAnchor(16, 14 '=')

This is why I think the anchors have changed in what they map, before the redux I designed them to map from a sequence's source, to it's alignment view. Not to map a sequence to it's reference.

@dcjones You've read seqan too and how they do this - do you understand where I'm coming from and why I'm trying to clarify exactly what it is the anchors map now?

bicycle1885 commented 8 years ago

I've implemented position mapping between sequences: https://github.com/BioJulia/Bio.jl/commit/2a67e3b67271a129617e1abfd2c9a59d469bf21f.

Is this (part of?) what you want?

TransGirlCodes commented 8 years ago

@bicycle1885 This is a nice addition, and it actually probably serves the purpose of the remaining functionality in my pre-redux branch. But really what I was talking about was just me grappling with understanding the difference between a). Mapping positions of an ungapped, unaligned sequence, with the gapped and aligned version if itself (which is what SeqAn uses anchors for), and b). This module, which uses the anchors to map positions between two different sequences - a reference, and a sequence aligned to said reference. Eventually I resolved this in my mind and helped me understand what was different about the code and why.

I'm very interested in how you did the findanchor binary search with a generated function. I approached this using Julia's sort api, and created two ordering types, allowing to search by refpos or seqpos, but you've done it slightly differently using generated functions. Does this move much logic to compile time vs. dispatching on oerdering types as I initially did? Either way I'm excited to see this use of '@generated'!

TransGirlCodes commented 8 years ago

@bicycle1885 This is a nice addition, and it actually probably serves the purpose of the remaining functionality in my pre-redux branch. But really what I was talking about was just me grappling with understanding the difference between a). Mapping positions of an ungapped, unaligned sequence, with the gapped and aligned version if itself (which is what SeqAn uses anchors for), and b). This module, which uses the anchors to map positions between two different sequences - a reference, and a sequence aligned to said reference. Eventually I resolved this in my mind and helped me understand what was different about the code and why.

I'm very interested in how you did the findanchor binary search with a generated function. I approached this using Julia's sort api, and created two ordering types, allowing to search by refpos or seqpos, but you've done it slightly differently using generated functions. Does this move much logic to compile time vs. dispatching on ordering types as I initially did? Either way I'm excited to see this use of '@generated'!

bicycle1885 commented 8 years ago

Alternative operations between OP_MATCH and OP_DELETE can be regarded as a gapped string, at least operationally. For example, ACG--TA can be represented as:

julia> using Bio.Align

julia> using Bio.Seq

# intact, unaligned sequence
julia> seq = dna"ACGTA"
5nt DNA Sequence
 ACGTA

# now this is a gapped sequence
julia> anlseq = AlignedSequence(seq, Alignment([AlignmentAnchor(0, 0, OP_START),
                                                AlignmentAnchor(3, 3, OP_MATCH),
                                                AlignmentAnchor(3, 5, OP_DELETE),
                                                AlignmentAnchor(5, 7, OP_MATCH)]))
·······
ACG--TA

Now you can map sequence positions between each other:

# unaligned => aligned
julia> [seq2ref(i, alnseq) for i in 1:5]
5-element Array{Tuple{Int64,Bio.Align.Operation},1}:
 (1,M)
 (2,M)
 (3,M)
 (6,M)
 (7,M)

# aligned => unaligned
julia> [ref2seq(i, alnseq) for i in 1:7]
7-element Array{Tuple{Int64,Bio.Align.Operation},1}:
 (1,M)
 (2,M)
 (3,M)
 (3,D)
 (3,D)
 (4,M)
 (5,M)

~~Though this may be an abuse of OP_DELETE~~(here I should have used the padding operation, right?), I think this can create gapped sequences like SeqAn.

I found @generated functions can remove boilerplate code without sacrificing performance because it can generate any expression just before runtime. I also think this would be a good example of @generated function.

TransGirlCodes commented 8 years ago

@bicycle1885 In the context of MSA, the gap could be justified as a delete operation, but pad is probably more technically accurate given it's intended purpose. Although delete is probably ok to use given that at the moment, constructing alignments with a pad leads to the message saying pads are not supported.

This is very cool :D I don't know if this, or the previous idea about a pseudo-reference is more efficient for certain operations. But I'm willing to start experimenting with the idea of MSA's next week, as it's required for a project I'm helping out on. I can base my work on your branch.

BioJulia / Bio.jl

Alignment - Support the Padding operation. #124