broadinstitute / wot

A software package for analyzing snapshots of developmental processes
https://broadinstitute.github.io/wot/
BSD 3-Clause "New" or "Revised" License
140 stars 34 forks source link

WOT integration as an external kernel in CellRank #91

Closed Marius1311 closed 3 years ago

Marius1311 commented 3 years ago

Hi all! As discussed recently with @geoffschieb, I believe WOT and CellRank have complementary strengths that could really benefit from each other. WOT is a fantastic resource to compute transport maps based on time-course scRNA-seq data, and CellRank is a framework to both compute and analyze transition matrices in a Markov-chain based framework, see Fig. 1 below. @geoffschieb and I agreed it would be very cool to interface to WOT from within CellRank through an external kernel - similar to how external packages are interfaced from scanpy through scanpy external. We've done this for StationaryOT already, in collab with @zsteve see here.

Computationally, this means we write a shallow wrapper in CellRank which internally calls WOT to compute a transition matrix based on data in a passed AnnData object. The outputs are then streamlined so that they can use in subsequent CellRank analysis though estimators.

@michalk8 and myself would love to hear your comments/ideas/feedback before we proceed with the actual implementation. In particular, we were wondering

Looking forward to your input!

Fig. 1

image Fig. 1 | CellRank is composed of kernels and estimators. In CellRank, we call everything that computes a transition matrix a "kernel" and everything that does inference based on a transition matrix an "estimator". Kernels are not limited in the input data they can accept - we currently support kernels for transcriptomic, velocity-based and spatial data, but envisage to extend this to other data modalities in the future. Every kernel class inherits from a base kernel class, equipping it with the basic functionality to communicate with AnnData objects and with some basic tools to visualize transition matrices (i.e. though drawing random walks or projections into low dimensional embeddings). Estimators eat kernels and usually spit out initial, terminal and intermediate states as well as fate probabilities, however, we can imagine more general estimator in the future that do other things with transition matrices as well. The only requirement for estimators is really that they work within the framework of Markov chains on (sparse) transition matrices. See CellRank's kernels and estimators tutorial for more info.

zsteve commented 3 years ago

Hi @Marius1311,

Since others have not responded, here are some of my thoughts:

how you would recommend to build a single transition matrix from the transport maps. Would you just row-normalize them and stick them together to form one block matrix with 0s for non-adjacent time-points?

Yes, from my understanding of Cellrank this seems to be the most straightforward option (and the time-series format means transitions in non-adjacent timepoints is impossible). I guess the main issue here would be that the resulting Markov chain is very much non-reversible (and might not be suitable for some downstream computations I suspect). At least for computing absorption probabilities this sounds fine.

Perhaps it may be beneficial to have a subclass for time series experiments (as opposed to single-snapshot experiments), since different analyses might be relevant to time-series.

why do you re-compute PCA for every pair of adjacent time points to compute the cost matrix? Could we use a global PCA for all time points here? I'm asking because that's usually already contained in the AnnData object, so no computational overhead. Also, could we use the KNN graph connectivities for the cost matrix? These would be sparse.

From my understanding it is probably fine to use a global PCA coordinates, at least nothing at the conceptual level is lost. And anyway, except for constructing the cost matrix the actual choice of embedding is not used elsewhere. For the cost matrix, what exactly do you mean by kNN connectivities? If you mean a "sparse" cost matrix (with c(x, y) = +inf for x, y not connected by an edge -- this would correspond to exp(-C/eps) being sparse), unfortunately this doesn't really work (at least for reasonable k). The reason is because there's no guarantee the kNN is connected or allows a feasible solution of the OT problem (and anecdotally this seems to often be the case).

One possibility (which we use quite often) is to use the KeOps library (https://github.com/getkeops/keops) which allows you to do computations with the cost matrix (in terms of mat-vec products) without evaluating explicitly the full matrix.

Perhaps it would be easiest to try adapting WOT to CellRank in the most straightforward manner (i.e. just make a block matrix from the individual transport maps) and see how things go. If that looks promising, I'm happy to help with optimising the implementation :)

Let me know if you have any further questions :)

Stephen

geoffschieb commented 3 years ago

I agree that making a sub-class for time-series sounds like the best option. But then you wouldn’t necessarily need to construct a full transition matrix for this sub-class would you?

On Tue, Apr 27, 2021 at 10:09 PM Stephen Zhang @.***> wrote:

Hi @Marius1311 https://github.com/Marius1311,

Since others have not responded, here are some of my thoughts:

how you would recommend to build a single transition matrix from the transport maps. Would you just row-normalize them and stick them together to form one block matrix with 0s for non-adjacent time-points?

Yes, from my understanding of Cellrank this seems to be the most straightforward option (and the time-series format means transitions in non-adjacent timepoints is impossible). I guess the main issue here would be that the resulting Markov chain is very much non-reversible (and might not be suitable for some downstream computations I suspect). At least for computing absorption probabilities this sounds fine.

Perhaps it may be beneficial to have a subclass for time series experiments (as opposed to single-snapshot experiments), since different analyses might be relevant to time-series.

why do you re-compute PCA for every pair of adjacent time points to compute the cost matrix? Could we use a global PCA for all time points here? I'm asking because that's usually already contained in the AnnData object, so no computational overhead. Also, could we use the KNN graph connectivities for the cost matrix? These would be sparse.

From my understanding it is probably fine to use a global PCA coordinates, at least nothing at the conceptual level is lost. And anyway, except for constructing the cost matrix the actual choice of embedding is not used elsewhere. For the cost matrix, what exactly do you mean by kNN connectivities? If you mean a "sparse" cost matrix (with c(x, y) = +inf for x, y not connected by an edge -- this would correspond to exp(-C/eps) being sparse), unfortunately this doesn't really work (at least for reasonable k). The reason is because there's no guarantee the kNN is connected or allows a feasible solution of the OT problem (and anecdotally this seems to often be the case).

One possibility (which we use quite often) is to use the KeOps library ( https://github.com/getkeops/keops) which allows you to do computations with the cost matrix (in terms of mat-vec products) without evaluating explicitly the full matrix.

Perhaps it would be easiest to try adapting WOT to CellRank in the most straightforward manner (i.e. just make a block matrix from the individual transport maps) and see how things go. If that looks promising, I'm happy to help with optimising the implementation :)

Let me know if you have any further questions :)

Stephen

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/91#issuecomment-828147755, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJCQYWDRTN2FNG5PWPZXSDTK6KCHANCNFSM43ILM56Q .

michalk8 commented 3 years ago

Thanks a lot for the comments!

I guess the main issue here would be that the resulting Markov chain is very much non-reversible (and might not be suitable for some downstream computations I suspect). At least for computing absorption probabilities this sounds fine.

You're right, it causes some stability issues with GPCCA. As for absorption probabilities, I haven't tried to compute it yet, but since we're relying on macrotates to determine the terminal states. Currently, this seems to be solved by not having 0 block diagonal, e.g. by using:

0.1 * <transition matrix with transition only within individual timepoints> + 
0.9 * <transition matrix from WOT's transport maps between consecutive timepoints>

The reason is because there's no guarantee the kNN is connected or allows a feasible solution of the OT problem (and anecdotally this seems to often be the case).

I see, thanks for the explanation. I will definitely experiment with the KeOps, looks really nice.

But then you wouldn’t necessarily need to construct a full transition matrix for this sub-class would you?

Yes, might be also interesting to consider subsets of timepoints, will discuss with @Marius1311 .

michalk8 commented 3 years ago

As we're nearly done with the integration (see https://github.com/theislab/cellrank/pull/595, comments/feedback welcome), I was wondering whether you guys are planning a new release on PyPI. It's because of the addition of passing cost matrices when computing transport maps done here.

Marius1311 commented 3 years ago

@zsteve, do you actually expect the growth rate estimates to converge with increasing iteration numbers?

geoffschieb commented 3 years ago

In many cases they converge, but I’m not sure this is always guaranteed?

On Tue, Jun 1, 2021 at 2:44 AM Marius Lange @.***> wrote:

@zsteve https://github.com/zsteve, do you actually expect the growth rate estimates to converge with increasing iteration numbers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/91#issuecomment-851987550, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJCQYV6IOGSCLG5KJN6BJ3TQSTYZANCNFSM43ILM56Q .

Marius1311 commented 3 years ago

Would it make sense to add an option to iterate until convergence? At least in case we start with uniform estimates?

geoffschieb commented 3 years ago

Perhaps as an advanced option. I would caution the user to inspect and understand the growth rates that are produced by that approach.

On Tue, Jun 1, 2021 at 4:15 AM Marius Lange @.***> wrote:

Would it make sense to add an option to iterate until convergence? At least in case we start with uniform estimates?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/91#issuecomment-852041661, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJCQYR2XM2RTIOXRTW52WDTQS6MRANCNFSM43ILM56Q .

zsteve commented 3 years ago

My suggestion would be that it's would be a good idea to have a some functionality to plot how the growth rates evolve over the repeated iterations (similar to in the WOT tutorial, where old/new rates are on the x/y axes respectively).

Marius1311 commented 3 years ago

That's relatively simple, see e.g. this post here: https://github.com/theislab/cellrank/pull/595#issuecomment-852139711

That would be a qualitative way to look at it. BTW, we merged the PR to CellRank, WOT is now included as an external kernel on the dev branch and will be part of our upcoming 1.4 release! Feel free to check this out and to leave comments either here or in issues in the CellRank repo.

michalk8 commented 3 years ago

I think this can be closed now, thanks @geoffschieb and @zsteve for valuable comments.

Marius1311 commented 3 years ago

Right, this is done. Thanks everyone! Check out the real time tutorial to see the integration in action.