Closed Marius1311 closed 1 year ago
Hi @Marius1311,
Thanks for your inquiry! Yes, it could make sense to downsample your tree if you are encountering terrible runtimes. Fortunately, we have tools for subsampling cells from an existing tree. There are two approaches that might be of interest to you:
remove_leaves_and_prune_lineages
method implemented for the CassiopeiaTree
class. This method takes in a list of leaves you'd like to remove from an existing tree. It will then remove these leaves and then prune edges up from the removed leaves until you hit an internal node that connects to an extant lineage. This process of pruning specifically can take a while, and there might be more efficient implementations, but I think with 20k cells it will not be an issue. Here's a code snippet from the UniformTreeSubsampler
class that should work for you:
leaf_remove = np.random.choice(
tree.leaves, n_remove, replace=False
)
tree.remove_leaves_and_prune_lineages(leaf_remove) tree.collapse_unifurcations()
* Second, you might be interested in speeding up this procedure by constraining the analysis to a subclade of the tree (you might even get a great speedup by constraining analysis to clades induced by the first set of internal nodes splitting off of the root). You can get a clade induced by an internal node using the [`subset_clade`](https://cassiopeia-lineage.readthedocs.io/en/latest/api/reference/cassiopeia.data.CassiopeiaTree.html#cassiopeia.data.CassiopeiaTree.subset_clade) method implemented for the `CassiopeiaTree` class.
I'd also like to ask how you are using `cas.tl.compute_evolutionary_coupling` in this case, as we were able to use this function for a tree of ~32k leaves without too much issue (though of course it did take a while). I wonder if you are re-computing the phylogenetic distance matrix for each iteration of the permutation test? If this is not the case and truly the quadratic run time here is impeding you progress, I think that the idea of having a subsampler option embedded in `cas.tl.compute_evolutionary_coupling` is quite a nice idea!
Hi @mattjones315! Thanks a lot for your quick reply! Before going into subsampling, just a quick question regarding your second point: what would be the best way to ensure that the phylogenetic distance matrix is not re-computed for each iteration of the permutation test? Should I just pre-compute it and pass it via dissimilarity_map=X
?
I'm just trying out this possibility, I've pre-computed the dissimilarity map, and it's a lot faster now.
Hi @Marius1311 -- no problem and glad to help!
That's exactly right - if you pass in the dissimilarity_map explicitly, a new one won't be computed. This is because the permutation test just shuffles the labels, it doesn't actually change the relationships between leaves.
But this issue points out a potential bug - we should only be computing the dissimilarity_map a single time within this function, it shouldn't be computed for every permutation. I'll flag this for a bugfix.
Also - if you are working with CRISPR/Cas9-based lineages, I would pay special attention to the dissimilarity function you use. I've personally found that node distances (i.e., the number of nodes separating two leaves) is more effective than mutation distances (the number of mutations along edges separating two leaves). Happy to provide other pointers if you are seeing anything odd.
I'm going to go ahead and close this issue since it sounds like we've solved the original question, but please don't hesitate to re-open the issue if other questions emerge.
Thanks @mattjones315! Are node distances equivalent to computing the shortest path on the tree between two given leaf nodes?
The way I have solved this for me now is as follows: as I want to use the trees within moslin (moscot-lineage) anyways, I set up a lineage problem, then by calling the problem.prepare
method, I compute lineage distances within each time point, where each time point corresponds to one group of lineage-related cells, i.e. one lineage tree. See https://moscot.readthedocs.io/en/latest/notebooks/examples/problems/600_leaf_distance.html
I suppose that these lineage distances, which we compute as shortest paths along the tree, are equivalent to node distances in Cassiopeia. I then extract the pre-computed lineage distance matrices from the prepared moscot problem object (one per time point/tree), and pass these into cassiopeia.tl.compute_evolutionary_coupling
method via dissimilarity_map=lineage_distance
. It seems to me that moscot computes the node distances faster than Cassiopeia, but I did not benchmark this.
Hi Cassiopeia Team,
I was wondering how I could speed up the computation of
cassiopeia.tl.compute_evolutionary_coupling
. I'm trying to run this on a set of 20k lineage related cells, and I don't really get anywhere. The number of categories I use asmeta_variable
is like 7, so that one is not too bad.Would it make sense to subsample cells from each category in
meta_variable
to speed up computations? I saw that the method has quadratic complexity in cell number. If so, how would I do that, i.e., how can I subsample Cassiopeia trees? I would appreciate any input, thanks in advance!