Open recursion-ninja opened 5 years ago
One thing that came up when writing the code to generate tests for candidate network edges (in this issue: https://github.com/amnh/PCG/issues/105) is some of the pain in re-indexing problems. If we had two networks N0 and N1 and wished to form the new network
o
/ \
/ \
N0 N1
Then in the above indexing scheme we would need to re-index the leaf nodes in N1 by the offset of the number of leaf nodes in N0 and then to re-index the internal nodes in N1 by number of total nodes in N0 (or via some other scheme). Given that very often we will wish to take a subtree/subgraph and perform work on it and then re-insert it into a larger graph it is worth thinking about a good way to represent this situation. For instance we could have something like:
ReferenceVector
{ references :: {-# UNPACK #-} !(Vector (IndexData e n))
, offset :: Int
}
[ or even:
data ReferenceVector =
ReferenceVector
{ internalReferences :: {-# UNPACK #-} !(Vector (IndexData e n))
, internalOffset :: Int
, leafReferences :: {-# UNPACK #-} !(Vector (IndexData e n))
, leafOffset :: Int
}
]
Then we would need a custom Indexable
instance which returns for instance a leaf node if we index into an offset vector by something in [0...n-1]
. I'm not sure if this is quite the correct approach but something like this is worth thinking about.
After a preorder pass, we assign "resolution cache" values to each node, and each resolution cache has a single character sequences in it. We were doing this because PhylogeneticNode
requires a resolution cache and we didn't want to create a different node and DAG type for the post order and preorder passes.
However, @Boarders had a great idea of useing a Functor
parameter for the DAG. The Functor
on the postorder pass can be a NonEmpty
and on the preorder pass it can be collapsed to Identity
.
We should consider this technique in conjunction with TypeFamilies
to simplify the type machinery surrounding the traversals over the graph.
Lets try and pull Bio.Graph.Constructions
out of pcg-core:pcg-data-strcutures
and into it's own sub-library or, even better, only defined and consumed in app/pcg/
.
We should strongly consider representing the leaf nodes and the internal nodes in separate vectors with separate type parameters. This will allow us to express in the type system that the internal nodes may have different data annotations than the leaf nodes. This is often the case.
Upon reading data in from files, the internal nodes have no data annotations (type: ()
) and the leaf nodes do have data derived from the file inputs.
After a postorder pass, the internal nodes will contain ResolutionCache
values and the leaf nodes, while they could be fitted with singleton ResolutionCache
values, they could also be represented as being just the leaf node decoration with no cache annotations.
After a preorder pass, the internal nodes no longer have ResolutionCache
values. These values are collapsed based on the optimal display tree for each block. The will have "raw" node decorations.
If the internal nodes are represented independently of the leaf nodes, we could use TypeFamily
application and a parameterized Functor
to represent the variance of the internal nodes. For example, Void
for a graph that was just read in, ResolutionCache
for the postorder result, and Identity
for the preorder result.
Consider leveraging the Ix
typeclass and Array
the package when interacting with leaf, vertex, (root?) vectors as separate or combined objects.
I believe that we want to have separate internal node data and leaf data values.
This would mean making the graph object a Functor
, Applicative
, and Monad
with respect to the leaf data and it would be a Bifunctor
and Biappliacative
in the internal nodes.
We should explore BiFoldable
and Bitraversable
a well.
G m f e n t
G
is the typem
is the embedded monad for monad transformerf
is the "structural functor" of the graphe
is the edge datumn
is the internal node data, this is the Bifunctor
pametert
is the terminal node data, this is the Functor
parameterWe want G
to be a monad transformer, with the monad m
.
We will use the "structural functor" f
to handle the "resolution caches" internally to the graph so that these structures don't pollute the data variables e
, n
, or t
. @Boarders and I discussed Identity
for initial graphs and result graphs, and NonEmpty
for the post order & pre-order to represent the resolution caches.
e
The edge data wouldn't be tied to any core Haskell type-class.
We want n
to represent the internal node data. This includes the root nodes, "tree nodes," and "network nodes." It excludes the terminal (leaf) nodes. It should be part of Bifunctor
, Biapplicative
, Bifoldable
, and Bitraversable
if at all possible.
We want t
to represent the terminal (leaf) node data. It should be part of the standard Functor
, Applicative
, Monad
, and maybe MonadZip
or MonadFix
.
A rough idea of how the parse result, post-order, pre-order, and final results might look:
parseResult :: G m Identity () LeafDec
postOrderResult :: G m NonEmpty PostDec LeafDec
finalResult :: G m Identity FinalDec FinalLeaf
postorder
:: (a -> g b)
-> (b -> b -> b)
-> G m f c a
-> G m g b a
preorder
:: Comonad f
=> (b -> d)
-> (d -> b -> d)
-> (d -> a -> c)
-> G m f b a
-> G m Identity d c
The topological representation should be updated to better represent the phylogenetic forest. Currently we represent phylogenetic DAGs in a newtyped non-empty list to represent a phylogenetic forest. This make it difficult to represent data structure invariant that apply to a phylogenetic forest. Specifically, that each leaf should appear in the phylogenetic forest exactly once. Instead we should make the phylogenetic forest the "primitive" representation.
We should have the leaf nodes places in indices
[0..n-1]
of the vector with internal nodes occupying indices[n..len-1]
. Once allocated, the leaf nodes should never have their indices altered. This will allow for more efficient bit-vector representation in the resolution cache and extracting the leaf set, amongst other things.This representation, where the phylogenetic forest is the primitive representation, will allow for more natural implementation of breaking a phylogenetic forest "component" on an edge, since the result will necessarily be a forest. It will also allow for more natural calculation of multi rooting cost.
A lot of traversal code will need to be re-implemented during this change. Afterwards, the interface should be stable and highlevel. There might be more opportunities for efficiency gains by using the
foldl
ormonad-memo
libraries in the tree traversals to build the updated vector and character matrix when using this representation.