Closed recursion-ninja closed 4 years ago
We'll need to encode an AlignmentContext
value in each of the DynamicCharacter
streams whch replace the current DynamicCharacterElement
values.
Something like this:
data AlignmentContext
= Align x y z
| Delete x z
| Insert y z
However, this representation will need to be "bit-packed" into the dynamic character stream so that it can be represented by a BitMatrix
. This can be done by using 2 prefix bits to uniquely identify which context the element is from, and then 3x the number of bits for a single DynamicCharacterElement
to store x
, y
, & z
. Note that in the Delete
and Insert
value representations, the y
and z
values will be represented by a "gap" character, respectively.
This representation will take up three times the memory for each dynamic character stream. However, by using this representation we can avoid performing any alignment on the pre-order pass of the tree and instead use a stream processing function to generate the final states. Empirical evidence shows that this results in an asymptotic speed up, resulting in several orders of magnitude less work.
Note: We should be able coerce the output from the C code's FFI to this format by zipping over the three aligned sequences and interpreting their results accordingly.
Note: Turns out that we cannot coerce the output from the C code's FFI to this format. There are alignments which return sequences from which it cannot not be inferred which output elements correspond to which input elements.
The implied alignment pre-order code has been added. See this commit: fd49df4f20c2a9cfabc0c1f7ec944b4cdbf494e6.
This will be merged into master
after more work has been done optimizing the dynamic character post-order code. The hope is that the optimized Haskell code will run comparable to the C code, and the issue of coercing the data across the FFI will be side stepped.
As you feared.
I came up with a solution to converting the C results to our Haskell data type. I had to modify the backtrace
code and the get_medians
code on the C side.
In the backtrace
section, instead of adding a gap_char
on the traceback when a leftward or upward arrow was encountered, we instead added a '\0'
value (no bits set). Since gap_char
has one bit set at the gap character index and '\0'
has no bits set, we can distinguish between "input gaps" and "output gaps" across the FFI.
The get_medians
section had to be modified as well, to not look for gap_char
in the alignments but instead look for '\0'
and then correctly generate the median for sigma(x, -)
when (x, '\0')
was encountered.
The C code is now compatible with the Haskell data-types and the C pairwise alignment returns the same results as the Haskell pairwise alignments.
Sweet
Closing as the implied alignment algorithm has been added (with a defect #166).
Implement implied alignment algorithm. It's already implement with the old data types. Some type changes are required and simplifying the logic for a post order & pre-order traversal.