amnh / PCG

𝙋𝙝𝙮𝙡𝙤𝙜𝙚𝙣𝙚𝙩𝙞𝙘 𝘾𝙤𝙢𝙥𝙤𝙣𝙚𝙣𝙩 𝙂𝙧𝙖𝙥𝙝 ⸺ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

Implied alignment implementation #64

Closed recursion-ninja closed 4 years ago

recursion-ninja commented 6 years ago

Implement implied alignment algorithm. It's already implement with the old data types. Some type changes are required and simplifying the logic for a post order & pre-order traversal.

recursion-ninja commented 5 years ago

We'll need to encode an AlignmentContext value in each of the DynamicCharacter streams whch replace the current DynamicCharacterElement values.

Something like this:

data  AlignmentContext
    = Align  x y z
    | Delete x   z
    | Insert   y z

However, this representation will need to be "bit-packed" into the dynamic character stream so that it can be represented by a BitMatrix. This can be done by using 2 prefix bits to uniquely identify which context the element is from, and then 3x the number of bits for a single DynamicCharacterElement to store x, y, & z. Note that in the Delete and Insert value representations, the y and z values will be represented by a "gap" character, respectively.

This representation will take up three times the memory for each dynamic character stream. However, by using this representation we can avoid performing any alignment on the pre-order pass of the tree and instead use a stream processing function to generate the final states. Empirical evidence shows that this results in an asymptotic speed up, resulting in several orders of magnitude less work.

Note: We should be able coerce the output from the C code's FFI to this format by zipping over the three aligned sequences and interpreting their results accordingly.

recursion-ninja commented 4 years ago

Note: Turns out that we cannot coerce the output from the C code's FFI to this format. There are alignments which return sequences from which it cannot not be inferred which output elements correspond to which input elements.

The implied alignment pre-order code has been added. See this commit: fd49df4f20c2a9cfabc0c1f7ec944b4cdbf494e6.

This will be merged into master after more work has been done optimizing the dynamic character post-order code. The hope is that the optimized Haskell code will run comparable to the C code, and the issue of coercing the data across the FFI will be side stepped.

wardwheeler commented 4 years ago

As you feared.

recursion-ninja commented 4 years ago

I came up with a solution to converting the C results to our Haskell data type. I had to modify the backtrace code and the get_medians code on the C side.

In the backtrace section, instead of adding a gap_char on the traceback when a leftward or upward arrow was encountered, we instead added a '\0' value (no bits set). Since gap_char has one bit set at the gap character index and '\0' has no bits set, we can distinguish between "input gaps" and "output gaps" across the FFI.

The get_medians section had to be modified as well, to not look for gap_char in the alignments but instead look for '\0' and then correctly generate the median for sigma(x, -) when (x, '\0') was encountered.

The C code is now compatible with the Haskell data-types and the C pairwise alignment returns the same results as the Haskell pairwise alignments.

wardwheeler commented 4 years ago

Sweet

recursion-ninja commented 4 years ago

Closing as the implied alignment algorithm has been added (with a defect #166).