Ambiguous greedy & parallel dissimilarity computation

mattjones315 commented 11 months ago

This PR implements two important changes.

Supporting ambiguous alleles in Cassiopeia Greedy algorithm.

Specific changes:

Enabled greedy splits to be found when states are ambiguous (by flattening ambiguous states)
Enabled missing data assignment in greedy splits when ambiguous states are present
A few new utilities around handling ambiguous states, implemented in cassiopeia.mixins.utilities
A few miscellaneous changes, not related to the greedy algorithm, especially around running tests when software is not installed.

Supporting parallel dissimilarity matrix

We implement a parallel dissimilarity matrix computation. Due to compatibility issues with numba, I introduce a wrapper function around the main bones of the dissimilarity map computation, and allow this to operate on batches. I notice a slight slow down for computations that would be numba jit-compatible (on the order of seconds) but I find drastic runtime improvements for computations that are not jit-compatible. This becomes particularly important for cases dealing with ambiguous alleles as currently the cluster dissimilarity function is not able to be compiled in nopython mode. Thus, dramatic speedups - roughly proportional the number of threads, ~10x speedup with 10 threads (as one would expect).

I also find that implementing more prescriptive cluster dissimilarity functions (e.g., specific function for linkage=np.min and dissimilarity_function=weighted_hamming_distance) allows the function to be compiled with nopython=False, forceobj=True, which does speed up the computation noticeably. I retain the original cluster computation to keep it backwards compatible and to allow users to experiment with various linkages and dissimilarity functions on cases where performance is not such an issue.

codecov[bot] commented 11 months ago

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (9f272fc) 79.54% compared to head (60a914e) 79.49%. Report is 2 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #226 +/- ## ========================================== - Coverage 79.54% 79.49% -0.06% ========================================== Files 89 89 Lines 7948 8050 +102 ========================================== + Hits 6322 6399 +77 - Misses 1626 1651 +25 ``` | [Files](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab) | Coverage Δ | | |---|---|---| | [cassiopeia/data/CassiopeiaTree.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9kYXRhL0Nhc3Npb3BlaWFUcmVlLnB5) | `91.78% <100.00%> (+0.02%)` | :arrow_up: | | [cassiopeia/solver/GreedySolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvR3JlZWR5U29sdmVyLnB5) | `98.64% <100.00%> (+0.07%)` | :arrow_up: | | [cassiopeia/solver/HybridSolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvSHlicmlkU29sdmVyLnB5) | `74.78% <100.00%> (-0.22%)` | :arrow_down: | | [cassiopeia/solver/NeighborJoiningSolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvTmVpZ2hib3JKb2luaW5nU29sdmVyLnB5) | `70.88% <ø> (ø)` | | | [cassiopeia/solver/UPGMASolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvVVBHTUFTb2x2ZXIucHk=) | `73.21% <ø> (ø)` | | | [cassiopeia/solver/VanillaGreedySolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvVmFuaWxsYUdyZWVkeVNvbHZlci5weQ==) | `100.00% <100.00%> (ø)` | | | [cassiopeia/mixins/utilities.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9taXhpbnMvdXRpbGl0aWVzLnB5) | `91.66% <93.75%> (+11.66%)` | :arrow_up: | | [cassiopeia/solver/DistanceSolver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvRGlzdGFuY2VTb2x2ZXIucHk=) | `59.32% <16.66%> (-1.21%)` | :arrow_down: | | [cassiopeia/solver/dissimilarity\_functions.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvZGlzc2ltaWxhcml0eV9mdW5jdGlvbnMucHk=) | `84.82% <84.37%> (-0.13%)` | :arrow_down: | | [cassiopeia/plotting/local.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9wbG90dGluZy9sb2NhbC5weQ==) | `88.01% <40.00%> (-1.95%)` | :arrow_down: | | ... and [2 more](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/226?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab) | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mattjones315 commented 11 months ago

Thanks @colganwi for a great review! I've made several of your requested changes, which were very insightful, and I believe I'm now ready for a second review.

One small comment is on the config.ini issue in .gitignore -- I found that if you specify it in the .gitignore, it doesn't get packaged. It's quite a tricky problem. So I removed it from tracking, added a dummy version to ./data and readded a cassiopeia/config.ini to my own personal distribution. Let me know if you have a better idea.

YosefLab / Cassiopeia