Fast UPGMA and NJ with CCPhylo

colganwi commented 1 year ago

The C implementations of UPGMA and NJ in CCPhylo are significantly faster than the current Cassiopeia implementations of these algorithms. This update adds a CCPhyloSolver subclass of DistanceSolver which writes the dissimilarity matrix to a temp file and then calls ccphlyo tree to solve the tree with the specified method. NeighborJoiningSolver and UPGMASolver are now subclasses of CCPhyloSolver so when fast is set to true they use CCPhyloSolver._fast_solve() instead of DistanceSolver.solve(). CCPhylo also includes two new algorithms Dynamic and Heuristic NJ which are significantly faster than standard NJ. DynamicNeighborJoiningSolver and HeuristicNeighborJoiningSolver are subclasses of NeighborJoiningSolver which implement these algorithms. In most cases DNJ should be used since it is guaranteed to generate an exact NJ tree.

colganwi commented 1 year ago

Benchmarking results

benchmarking

mattjones315 commented 1 year ago

Oh and one more comment -- we should make sure to merge in the latest changes to the master branch, as the pandas typing/ILP error appears to still be a problem in running the tests.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 39.74% and project coverage change: -0.44% :warning:

Comparison is base (f895301) 79.05% compared to head (7f96e86) 78.62%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #205 +/- ## ========================================== - Coverage 79.05% 78.62% -0.44% ========================================== Files 85 85 Lines 7080 7157 +77 ========================================== + Hits 5597 5627 +30 - Misses 1483 1530 +47 ``` | [Files Changed](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab) | Coverage Δ | | |---|---|---| | [cassiopeia/solver/DistanceSolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvRGlzdGFuY2VTb2x2ZXIucHk=) | `60.52% <25.45%> (-32.81%)` | :arrow_down: | | [cassiopeia/solver/NeighborJoiningSolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvTmVpZ2hib3JKb2luaW5nU29sdmVyLnB5) | `70.88% <50.00%> (-1.72%)` | :arrow_down: | | [cassiopeia/solver/UPGMASolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvVVBHTUFTb2x2ZXIucHk=) | `73.21% <50.00%> (-2.79%)` | :arrow_down: | | [cassiopeia/solver/SpectralNeighborJoiningSolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvU3BlY3RyYWxOZWlnaGJvckpvaW5pbmdTb2x2ZXIucHk=) | `98.92% <100.00%> (+0.01%)` | :arrow_up: | | [cassiopeia/solver/solver\_utilities.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvc29sdmVyX3V0aWxpdGllcy5weQ==) | `86.00% <100.00%> (+3.50%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

colganwi commented 1 year ago

Unfortunately, I implemented @mattjones315's suggestion of removing the CCPhyloSolver class (c0c5ca926624067026e26d5483e49a18b3bc3f80) before seeing @sprillo's comments. This commit addresses the issue of a future "even_faster_implementation" by changing CCPhyloSolver._fast_solve to DistanceSolver._ccphylo_solve. The fast solver is then selected in DistanceSolver.__init__ which would allow other fast solver implementations.

But now that I've read @sprillo's comment I think his solution is cleaner since it doesn't require modifications to the DistanceSolver class and would allow for CCPhylo methods such as K-means Closest First to be run without creating a dedicated KMeansClosestFirstSolver subclass of the DistanceSolver. However, one issue with this approach is rooting the tree. Currently, the CCPhyloSolver versions of UPGMA and NJ use different rooting strategies to ensure that the results are the same as the existing implementations. It would be redundant to also put these rooting strategies into the CCPhyloSolver class. Of course, the obvious solution to this is to make the rooting functions compositional but this will require some work/reorganization. Do we want to make this change to how the solvers are organized?

colganwi commented 1 year ago

@sprillo I completely agree that the current architecture of the solvers, particularly the DistanceSolver, is not ideal. The main reason I built the CCPhyloSolver on top the DistanceSolver was to take advantage of the logic for rooting trees and calculating dissimilarity maps. Let's plan to address this when we refactor the solvers. We could either create a more generic distance DistanceSolver, or with a sufficiently compositional framework, write a CCPhyloSolver that does not inherent from DistanceSolver. When @mattjones315 opens the issue we can have a more in depth conversation there.

@mattjones315 thanks for the additional comments. I'm going to add two more commits, one addressing the comments and adding tests, and one with a few new features including multithreading and subclasses for the other solvers implemented by CCPhylo. Once those commits are reviewed we can merge.

mattjones315 commented 1 year ago

Hi @colganwi, sounds like a great plan. Looking forward to those commits.

I've now opened an issue referencing our thoughts on refactoring the DistanceSolver and the solver module in general which you can find in #214. William, once you accept the invitation to join the Cassiopeia repo as a collaborator I'll add you to the list of assignees (currently just me and Sebastian).

colganwi commented 1 year ago

Should be good to go. Since we are planning to refactor I've minimized changes to the solver class structure. The DNJ and HNJ classes have been removed in favor of an implementation parameter in the NeighborJoiningSolver constructor and the DistanceSolver constructor is now unchanged. One advantage of this approach is that CCPhylo DNJ is now the default fast implementation of NeighborJoining.

I've also decided to table adding multithreading and classes for other CCPhylo solvers for now to minimize work when we refactor. If we bring back the CCPhyloSolver class these options can be built into it.

YosefLab / Cassiopeia

Fast UPGMA and NJ with CCPhylo #205

Codecov Report