Open recursion-ninja opened 5 years ago
See this commit 9271b8ed47fffcfe663058811d71d6b8e50344d6 where I removed the old, unsafe IO version.
There is a branch haskell-function-memoize
in which I added an initial HashTable
, ST
, and TVar
implementation. It currently doesn't compile because of a type error involving the s
state type in the ST
monad. I'm hoping @Boarders can help me look at this and see if it can be rectified.
I switched to an IO
based implementation. It kinda seems to work... in a finicky way. I'll try shoving this in our codebase and measure the performance against the FFI-based memoized TCM.
The experiment was inconclusive. There is a fold over the bit vectors (DynamicCharacterElement
) to perform the pairwise or threeway Sankoff algorithm for the cost and median that takes up all the time when using the native Haskell HashTable
embedded in IO
and a TVar
. Optimizing the fold over the bit vector to allocate far less memory will be required for this to be viable.
We are de-prioritizing this fold optimization for the time being. This should be revisited at a later date. See the overlap
, overlap2
, & overlap3
functions in 971d8cf74e584ebd80ab9084f84074e09423b4f0.
After completing #141, we can use the benchmarks to quantify if this Haskell memoized TCM is better for usage with string-alignment median lookups.
We have moderate confidence that the memoize
function is working as intended and is thread safe.
However, the memoized function is not being retained and reapplied between string alignment functions. The consequence is that the memoization work is discarded between each string alignment.
There may be a technique to remedy this. The hypothesis is that because we define the pairwise "overlap" function though a Getter
that special cases a MetricRepresentation
value, it re-creates the memoized function on each call to the getter even though the input and output are the same.
We are considering embedding the memoized function into the MetricRepresentation
like so:
data DynamicCharacterMetadataDec c
= DynamicCharacterMetadataDec
{ optimalTraversalFoci :: !(Maybe TraversalFoci)
, structuralRepresentationTCM :: !(Either
(DenseTransitionCostMatrix, MetricRepresentation ())
(MetricRepresentation (c -> c -> (c,Word))) -- ^ Changes here
)
, metadata :: {-# UNPACK #-} !DiscreteCharacterMetadataDec
} deriving (Generic, NFData)
Because this would involve having a function in the "metadata sequence," we could no longer utilize compact regions. This is a potentially large downside.
The upside is that, if successful, we would not require a C++ FFI binding, could simplify the Exportable
type-class, and further modularize the sub-library build architecture.
This hypothesis would constitute a fairly decent effort to test.
We should also look into using concurrent-hashtable
.
I have added the memoized functions to the metadata and excised the usage of compact region from the code. However, this has broken the SAVE
and LOAD
commands, due to these using compact regions to generate the save states.
We are switching to Binary
instances from the binary
package to substitute the functionality. Unfortunately this requires writing a lot of Binary
instances for almost every data-type in our codebase (either explicitly or through a deriving
mechanism). It constitutes to a fair amount of mechanical, straightforward work.
Experiment with ST, TVars, and HashTable libraries to implement a Haskell-side memoization library for Hashable hashable structures. See if this is more efficient than the FFI binding to C++.
We want a type signature that looks like this:
(Eq a, Hashable a) => (a -> b) -> a -> b