Closed mattjones315 closed 11 months ago
Attention: 36 lines
in your changes are missing coverage. Please review.
Comparison is base (
9f272fc
) 79.54% compared to head (60a914e
) 79.49%. Report is 2 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks @colganwi for a great review! I've made several of your requested changes, which were very insightful, and I believe I'm now ready for a second review.
One small comment is on the config.ini
issue in .gitignore
-- I found that if you specify it in the .gitignore
, it doesn't get packaged. It's quite a tricky problem. So I removed it from tracking, added a dummy version to ./data
and readded a cassiopeia/config.ini
to my own personal distribution. Let me know if you have a better idea.
This PR implements two important changes.
Supporting ambiguous alleles in Cassiopeia Greedy algorithm.
Specific changes:
cassiopeia.mixins.utilities
Supporting parallel dissimilarity matrix
We implement a parallel dissimilarity matrix computation. Due to compatibility issues with
numba
, I introduce a wrapper function around the main bones of the dissimilarity map computation, and allow this to operate on batches. I notice a slight slow down for computations that would be numba jit-compatible (on the order of seconds) but I find drastic runtime improvements for computations that are not jit-compatible. This becomes particularly important for cases dealing with ambiguous alleles as currently the cluster dissimilarity function is not able to be compiled innopython
mode. Thus, dramatic speedups - roughly proportional the number of threads, ~10x speedup with 10 threads (as one would expect).I also find that implementing more prescriptive cluster dissimilarity functions (e.g., specific function for
linkage=np.min
anddissimilarity_function=weighted_hamming_distance
) allows the function to be compiled withnopython=False, forceobj=True
, which does speed up the computation noticeably. I retain the original cluster computation to keep it backwards compatible and to allow users to experiment with various linkages and dissimilarity functions on cases where performance is not such an issue.