amnh / PCG

𝙋𝙝𝙮𝙡𝙤𝙜𝙚𝙣𝙚𝙩𝙞𝙘 𝘾𝙤𝙢𝙥𝙤𝙣𝙚𝙣𝙩 𝙂𝙧𝙖𝙥𝙝 ⸺ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

Experiment with Haskell-side, hashable memoized function(s) #100

Open recursion-ninja opened 5 years ago

recursion-ninja commented 5 years ago

Experiment with ST, TVars, and HashTable libraries to implement a Haskell-side memoization library for Hashable hashable structures. See if this is more efficient than the FFI binding to C++.

We want a type signature that looks like this: (Eq a, Hashable a) => (a -> b) -> a -> b

recursion-ninja commented 5 years ago

See this commit 9271b8ed47fffcfe663058811d71d6b8e50344d6 where I removed the old, unsafe IO version.

recursion-ninja commented 5 years ago

There is a branch haskell-function-memoize in which I added an initial HashTable, ST, and TVar implementation. It currently doesn't compile because of a type error involving the s state type in the ST monad. I'm hoping @Boarders can help me look at this and see if it can be rectified.

recursion-ninja commented 5 years ago

I switched to an IO based implementation. It kinda seems to work... in a finicky way. I'll try shoving this in our codebase and measure the performance against the FFI-based memoized TCM.

recursion-ninja commented 5 years ago

The experiment was inconclusive. There is a fold over the bit vectors (DynamicCharacterElement) to perform the pairwise or threeway Sankoff algorithm for the cost and median that takes up all the time when using the native Haskell HashTable embedded in IO and a TVar. Optimizing the fold over the bit vector to allocate far less memory will be required for this to be viable.

We are de-prioritizing this fold optimization for the time being. This should be revisited at a later date. See the overlap, overlap2, & overlap3 functions in 971d8cf74e584ebd80ab9084f84074e09423b4f0.

recursion-ninja commented 5 years ago

After completing #141, we can use the benchmarks to quantify if this Haskell memoized TCM is better for usage with string-alignment median lookups.

recursion-ninja commented 5 years ago

We have moderate confidence that the memoize function is working as intended and is thread safe.

However, the memoized function is not being retained and reapplied between string alignment functions. The consequence is that the memoization work is discarded between each string alignment.

There may be a technique to remedy this. The hypothesis is that because we define the pairwise "overlap" function though a Getter that special cases a MetricRepresentation value, it re-creates the memoized function on each call to the getter even though the input and output are the same. We are considering embedding the memoized function into the MetricRepresentation like so:

data DynamicCharacterMetadataDec c                                                                       
   = DynamicCharacterMetadataDec                                                                         
   { optimalTraversalFoci        :: !(Maybe TraversalFoci)                                               
   , structuralRepresentationTCM :: !(Either                                                             
                                        (DenseTransitionCostMatrix, MetricRepresentation ())             
                                        (MetricRepresentation (c -> c -> (c,Word))) -- ^ Changes here                       
                                     )                                                                   
   , metadata                    :: {-# UNPACK #-} !DiscreteCharacterMetadataDec                         
   } deriving (Generic, NFData) 

Because this would involve having a function in the "metadata sequence," we could no longer utilize compact regions. This is a potentially large downside.

The upside is that, if successful, we would not require a C++ FFI binding, could simplify the Exportable type-class, and further modularize the sub-library build architecture.

This hypothesis would constitute a fairly decent effort to test.

recursion-ninja commented 4 years ago

We should also look into using concurrent-hashtable.

recursion-ninja commented 4 years ago

I have added the memoized functions to the metadata and excised the usage of compact region from the code. However, this has broken the SAVE and LOAD commands, due to these using compact regions to generate the save states.

We are switching to Binary instances from the binary package to substitute the functionality. Unfortunately this requires writing a lot of Binary instances for almost every data-type in our codebase (either explicitly or through a deriving mechanism). It constitutes to a fair amount of mechanical, straightforward work.