How to speed up evolutionary coupling computation

Marius1311 commented 1 year ago

Hi Cassiopeia Team,

I was wondering how I could speed up the computation of cassiopeia.tl.compute_evolutionary_coupling. I'm trying to run this on a set of 20k lineage related cells, and I don't really get anywhere. The number of categories I use as meta_variable is like 7, so that one is not too bad.

Would it make sense to subsample cells from each category in meta_variable to speed up computations? I saw that the method has quadratic complexity in cell number. If so, how would I do that, i.e., how can I subsample Cassiopeia trees? I would appreciate any input, thanks in advance!

mattjones315 commented 1 year ago

Hi @Marius1311,

Thanks for your inquiry! Yes, it could make sense to downsample your tree if you are encountering terrible runtimes. Fortunately, we have tools for subsampling cells from an existing tree. There are two approaches that might be of interest to you:

First, you can use the remove_leaves_and_prune_lineages method implemented for the CassiopeiaTree class. This method takes in a list of leaves you'd like to remove from an existing tree. It will then remove these leaves and then prune edges up from the removed leaves until you hit an internal node that connects to an extant lineage. This process of pruning specifically can take a while, and there might be more efficient implementations, but I think with 20k cells it will not be an issue. Here's a code snippet from the UniformTreeSubsampler class that should work for you:
```
leaf_remove = np.random.choice(
tree.leaves, n_remove, replace=False
)
```

tree.remove_leaves_and_prune_lineages(leaf_remove) tree.collapse_unifurcations()



* Second, you might be interested in speeding up this procedure by constraining the analysis to a subclade of the tree (you might even get a great speedup by constraining analysis to clades induced by the first set of internal nodes splitting off of the root). You can get a clade induced by an internal node using the [`subset_clade`](https://cassiopeia-lineage.readthedocs.io/en/latest/api/reference/cassiopeia.data.CassiopeiaTree.html#cassiopeia.data.CassiopeiaTree.subset_clade) method implemented for the `CassiopeiaTree` class. 

I'd also like to ask how you are using `cas.tl.compute_evolutionary_coupling` in this case, as we were able to use this function for a tree of ~32k leaves without too much issue (though of course it did take a while). I wonder if you are re-computing the phylogenetic distance matrix for each iteration of the permutation test? If this is not the case and truly the quadratic run time here is impeding you progress, I think that the idea of having a subsampler option embedded in `cas.tl.compute_evolutionary_coupling` is quite a nice idea!

Marius1311 commented 1 year ago

Hi @mattjones315! Thanks a lot for your quick reply! Before going into subsampling, just a quick question regarding your second point: what would be the best way to ensure that the phylogenetic distance matrix is not re-computed for each iteration of the permutation test? Should I just pre-compute it and pass it via dissimilarity_map=X?

I'm just trying out this possibility, I've pre-computed the dissimilarity map, and it's a lot faster now.

mattjones315 commented 1 year ago

Hi @Marius1311 -- no problem and glad to help!

That's exactly right - if you pass in the dissimilarity_map explicitly, a new one won't be computed. This is because the permutation test just shuffles the labels, it doesn't actually change the relationships between leaves.

But this issue points out a potential bug - we should only be computing the dissimilarity_map a single time within this function, it shouldn't be computed for every permutation. I'll flag this for a bugfix.

Also - if you are working with CRISPR/Cas9-based lineages, I would pay special attention to the dissimilarity function you use. I've personally found that node distances (i.e., the number of nodes separating two leaves) is more effective than mutation distances (the number of mutations along edges separating two leaves). Happy to provide other pointers if you are seeing anything odd.

I'm going to go ahead and close this issue since it sounds like we've solved the original question, but please don't hesitate to re-open the issue if other questions emerge.

Marius1311 commented 1 year ago

Thanks @mattjones315! Are node distances equivalent to computing the shortest path on the tree between two given leaf nodes?

The way I have solved this for me now is as follows: as I want to use the trees within moslin (moscot-lineage) anyways, I set up a lineage problem, then by calling the problem.prepare method, I compute lineage distances within each time point, where each time point corresponds to one group of lineage-related cells, i.e. one lineage tree. See https://moscot.readthedocs.io/en/latest/notebooks/examples/problems/600_leaf_distance.html

I suppose that these lineage distances, which we compute as shortest paths along the tree, are equivalent to node distances in Cassiopeia. I then extract the pre-computed lineage distance matrices from the prepared moscot problem object (one per time point/tree), and pass these into cassiopeia.tl.compute_evolutionary_coupling method via dissimilarity_map=lineage_distance. It seems to me that moscot computes the node distances faster than Cassiopeia, but I did not benchmark this.

YosefLab / Cassiopeia

How to speed up evolutionary coupling computation #209