Closed colganwi closed 1 year ago
Benchmarking results
Oh and one more comment -- we should make sure to merge in the latest changes to the master
branch, as the pandas typing/ILP error appears to still be a problem in running the tests.
Patch coverage: 39.74%
and project coverage change: -0.44%
:warning:
Comparison is base (
f895301
) 79.05% compared to head (7f96e86
) 78.62%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Unfortunately, I implemented @mattjones315's suggestion of removing the CCPhyloSolver class (c0c5ca926624067026e26d5483e49a18b3bc3f80) before seeing @sprillo's comments. This commit addresses the issue of a future "even_faster_implementation" by changing CCPhyloSolver._fast_solve
to DistanceSolver._ccphylo_solve
. The fast solver is then selected in DistanceSolver.__init__
which would allow other fast solver implementations.
But now that I've read @sprillo's comment I think his solution is cleaner since it doesn't require modifications to the DistanceSolver class and would allow for CCPhylo methods such as K-means Closest First to be run without creating a dedicated KMeansClosestFirstSolver
subclass of the DistanceSolver
. However, one issue with this approach is rooting the tree. Currently, the CCPhyloSolver versions of UPGMA and NJ use different rooting strategies to ensure that the results are the same as the existing implementations. It would be redundant to also put these rooting strategies into the CCPhyloSolver
class. Of course, the obvious solution to this is to make the rooting functions compositional but this will require some work/reorganization. Do we want to make this change to how the solvers are organized?
@sprillo I completely agree that the current architecture of the solvers, particularly the DistanceSolver
, is not ideal. The main reason I built the CCPhyloSolver
on top the DistanceSolver
was to take advantage of the logic for rooting trees and calculating dissimilarity maps. Let's plan to address this when we refactor the solvers. We could either create a more generic distance DistanceSolver
, or with a sufficiently compositional framework, write a CCPhyloSolver
that does not inherent from DistanceSolver
. When @mattjones315 opens the issue we can have a more in depth conversation there.
@mattjones315 thanks for the additional comments. I'm going to add two more commits, one addressing the comments and adding tests, and one with a few new features including multithreading and subclasses for the other solvers implemented by CCPhylo. Once those commits are reviewed we can merge.
Hi @colganwi, sounds like a great plan. Looking forward to those commits.
I've now opened an issue referencing our thoughts on refactoring the DistanceSolver
and the solver
module in general which you can find in #214. William, once you accept the invitation to join the Cassiopeia
repo as a collaborator I'll add you to the list of assignees (currently just me and Sebastian).
Should be good to go. Since we are planning to refactor I've minimized changes to the solver class structure. The DNJ and HNJ classes have been removed in favor of an implementation parameter in the NeighborJoiningSolver
constructor and the DistanceSolver
constructor is now unchanged. One advantage of this approach is that CCPhylo DNJ is now the default fast implementation of NeighborJoining.
I've also decided to table adding multithreading and classes for other CCPhylo solvers for now to minimize work when we refactor. If we bring back the CCPhyloSolver class these options can be built into it.
The C implementations of UPGMA and NJ in CCPhylo are significantly faster than the current Cassiopeia implementations of these algorithms. This update adds a
CCPhyloSolver
subclass ofDistanceSolver
which writes the dissimilarity matrix to a temp file and then callsccphlyo tree
to solve the tree with the specified method.NeighborJoiningSolver
andUPGMASolver
are now subclasses ofCCPhyloSolver
so when fast is set to true they useCCPhyloSolver._fast_solve()
instead ofDistanceSolver.solve()
. CCPhylo also includes two new algorithms Dynamic and Heuristic NJ which are significantly faster than standard NJ.DynamicNeighborJoiningSolver
andHeuristicNeighborJoiningSolver
are subclasses ofNeighborJoiningSolver
which implement these algorithms. In most cases DNJ should be used since it is guaranteed to generate an exact NJ tree.