amnh / PCG

๐™‹๐™๐™ฎ๐™ก๐™ค๐™œ๐™š๐™ฃ๐™š๐™ฉ๐™ž๐™˜ ๐˜พ๐™ค๐™ข๐™ฅ๐™ค๐™ฃ๐™š๐™ฃ๐™ฉ ๐™‚๐™ง๐™–๐™ฅ๐™ โธบ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

Reimplement topological representation #86

Open recursion-ninja opened 5 years ago

recursion-ninja commented 5 years ago

The topological representation should be updated to better represent the phylogenetic forest. Currently we represent phylogenetic DAGs in a newtyped non-empty list to represent a phylogenetic forest. This make it difficult to represent data structure invariant that apply to a phylogenetic forest. Specifically, that each leaf should appear in the phylogenetic forest exactly once. Instead we should make the phylogenetic forest the "primitive" representation.

We should have the leaf nodes places in indices [0..n-1] of the vector with internal nodes occupying indices [n..len-1]. Once allocated, the leaf nodes should never have their indices altered. This will allow for more efficient bit-vector representation in the resolution cache and extracting the leaf set, amongst other things.

This representation, where the phylogenetic forest is the primitive representation, will allow for more natural implementation of breaking a phylogenetic forest "component" on an edge, since the result will necessarily be a forest. It will also allow for more natural calculation of multi rooting cost.

A lot of traversal code will need to be re-implemented during this change. Afterwards, the interface should be stable and highlevel. There might be more opportunities for efficiency gains by using the foldl or monad-memo libraries in the tree traversals to build the updated vector and character matrix when using this representation.

Boarders commented 5 years ago

One thing that came up when writing the code to generate tests for candidate network edges (in this issue: https://github.com/amnh/PCG/issues/105) is some of the pain in re-indexing problems. If we had two networks N0 and N1 and wished to form the new network

                                o
                               /  \
                              /    \
                            N0     N1

Then in the above indexing scheme we would need to re-index the leaf nodes in N1 by the offset of the number of leaf nodes in N0 and then to re-index the internal nodes in N1 by number of total nodes in N0 (or via some other scheme). Given that very often we will wish to take a subtree/subgraph and perform work on it and then re-insert it into a larger graph it is worth thinking about a good way to represent this situation. For instance we could have something like:

        ReferenceVector
          { references :: {-# UNPACK #-} !(Vector (IndexData e n))
          , offset     :: Int
          }

[ or even:

data ReferenceVector = 
        ReferenceVector
          { internalReferences :: {-# UNPACK #-} !(Vector (IndexData e n))
          , internalOffset     :: Int
          , leafReferences     :: {-# UNPACK #-} !(Vector (IndexData e n))
          , leafOffset         :: Int
          }

] Then we would need a custom Indexable instance which returns for instance a leaf node if we index into an offset vector by something in [0...n-1]. I'm not sure if this is quite the correct approach but something like this is worth thinking about.

recursion-ninja commented 5 years ago

After a preorder pass, we assign "resolution cache" values to each node, and each resolution cache has a single character sequences in it. We were doing this because PhylogeneticNode requires a resolution cache and we didn't want to create a different node and DAG type for the post order and preorder passes.

However, @Boarders had a great idea of useing a Functor parameter for the DAG. The Functor on the postorder pass can be a NonEmpty and on the preorder pass it can be collapsed to Identity.

We should consider this technique in conjunction with TypeFamilies to simplify the type machinery surrounding the traversals over the graph.

recursion-ninja commented 5 years ago

Lets try and pull Bio.Graph.Constructions out of pcg-core:pcg-data-strcutures and into it's own sub-library or, even better, only defined and consumed in app/pcg/.

recursion-ninja commented 5 years ago

We should strongly consider representing the leaf nodes and the internal nodes in separate vectors with separate type parameters. This will allow us to express in the type system that the internal nodes may have different data annotations than the leaf nodes. This is often the case.

If the internal nodes are represented independently of the leaf nodes, we could use TypeFamily application and a parameterized Functor to represent the variance of the internal nodes. For example, Void for a graph that was just read in, ResolutionCache for the postorder result, and Identity for the preorder result.

recursion-ninja commented 5 years ago

Consider leveraging the Ix typeclass and Array the package when interacting with leaf, vertex, (root?) vectors as separate or combined objects.

recursion-ninja commented 5 years ago

I believe that we want to have separate internal node data and leaf data values.

This would mean making the graph object a Functor, Applicative, and Monad with respect to the leaf data and it would be a Bifunctor and Biappliacative in the internal nodes.

We should explore BiFoldable and Bitraversable a well.

G m f e n t

We want G to be a monad transformer, with the monad m.

We will use the "structural functor" f to handle the "resolution caches" internally to the graph so that these structures don't pollute the data variables e, n, or t. @Boarders and I discussed Identity for initial graphs and result graphs, and NonEmpty for the post order & pre-order to represent the resolution caches.

e The edge data wouldn't be tied to any core Haskell type-class.

We want n to represent the internal node data. This includes the root nodes, "tree nodes," and "network nodes." It excludes the terminal (leaf) nodes. It should be part of Bifunctor, Biapplicative, Bifoldable, and Bitraversable if at all possible.

We want t to represent the terminal (leaf) node data. It should be part of the standard Functor, Applicative, Monad, and maybe MonadZip or MonadFix.

A rough idea of how the parse result, post-order, pre-order, and final results might look:

parseResult :: G m Identity () LeafDec

postOrderResult :: G m NonEmpty PostDec LeafDec

finalResult :: G m Identity FinalDec FinalLeaf

postorder 
  :: (a -> g b) 
  -> (b -> b -> b) 
  -> G m f c a
  -> G m g b a

preorder 
  :: Comonad f
  => (b -> d) 
  -> (d -> b -> d)
  -> (d -> a -> c)
  -> G m f b a
  -> G m Identity d c