dynverse / dynmethods

A collection of 50+ trajectory inference methods within a common interface 📥📤
https://dynverse.org
Other
115 stars 26 forks source link

PAGA #15

Closed zouter closed 5 years ago

zouter commented 6 years ago

Hello @falexwolf and @theislab

This issue is for discussing the wrapper for your trajectory inference method, PAGA, which we wrapped for our benchmarking study (10.1101/276907). In our dynmethods framework, we collected some meta information about your method, and created a docker wrapper so that all methods can be easily run and compared. The code for this wrapper is located in docker containers[1][2]. The way this container is structured is described in this vignette.

We created 2 separate wrappers:

We are creating this issue to ensure your method is being evaluated in the way it was designed for. The checklist below contains some important questions for you to have a look at.

The most convenient way for you to test and adapt the wrapper is to install dyno, download and modify these files, and run your method on a dataset of interest or one of our synthetic toy datasets. This is further described in this vignette. Once finished, we prefer that you fork the dynmethods repository, make the necessary changes, and send us a pull request. Alternatively, you can also send us the files and we will make the necessary changes.

If you have any further questions or remarks, feel free to reply to this issue.

Kind regards, @zouter and @rcannood

falexwolf commented 6 years ago

Hi! Wow, thank you! And sorry for the late response. I was offline on holidays for a week. I will go through this today or tomorrow.

falexwolf commented 6 years ago

This is already doing a very good job but misses a few important things, probably because we were unable to properly communicate our updated preprint and repository - the journal doesn't allow to revise preprints...

Should I simply go and make a pull request here and add the few things that I think miss? Or would you prefer to read this here in the issues and then go ahead and implement it yourself?

The revised PAGA preprint is here: https://rawgit.com/falexwolf/paga_paper/master/paga.pdf The revised and final PAGA repository is here: https://github.com/theislab/paga - please link to this repository...

zouter commented 6 years ago

Hi @falexwolf

Had a look at the updated preprint, looks great! Indeed too bad that the journal doesn't allow preprints, we'll run into that issue as well...

If you have the time, feel free to do a pull request! Specifically for PAGA we found it kind of hard to extract a continuous trajectory from the data with pseudotimes, so for the first wrapper we simply used the graph between clusters. We also included a second wrapper which will project the cells within the dimensionality reduction on the edges between the clusters. But as both these approaches will probably work suboptimally, it would be great if you could help us out with this!

Indeed, we didn't know about the updated repository, feel free to also change it in the pull request.

Soon, we will also add functionality to dynverse/dynwrap so that it will no longer have to save the data to disk for transfor to the docker, which will hopefully make the docker also work well on large datasets.

Wouter

falexwolf commented 6 years ago

Hi Wouter! Thanks!

I'll go ahead and make a PR today or tomorrow.

We also included a second wrapper which will project the cells within the dimensionality reduction on the edges between the clusters.

Ah... I see. I wrote an extensive comment in the QC worksheet (seems to have already disappeared) essentially stating that PAGA doesn't give you single-cell orderings: for that, you need to combine it with the extended DPT in Scanpy. I'll make sure that this standard workflow is reflected in the pull request. Evidently, PAGA for a single line topology is not of much use... one can directly use DPT for this... Initially, we had a PAGA tool that would internally call DPT for the cell orderings. But we thought it's more transparent to disentangle this and call the two independent tools (DPT, PAGA) subsequently.

zouter commented 6 years ago

Hi @falexwolf

I managed to get the pseudotime working, thanks!

I added the following to the pull request

Based on this I got some sensible results on some datasets I tried:

A toy (!) linear dataset image

A toy (!) bifurcating dataset image

A (small) real dataset image

If you want I can also test it out with more complex and larger datasets.

falexwolf commented 6 years ago

Dear @zouter,

impressing how fast you're doing these things. The pictures above look sensible to me. They also look very nice. Kudos on your awesome project/package/environment.

Regarding your comments:

I'm pasting a few examples that hopefully illustrate what I write above...

image

Very happy to discuss further...

rcannood commented 6 years ago

Since @falexwolf and @zouter seemed to agree on the current wrapper, I merged PR #77 into devel :)

zouter commented 6 years ago

Hi Alex

Thanks for all the nice feedback, and if you want anything changed, feel free to ask / make a pull request!

falexwolf commented 6 years ago

Hi Wouter,

incredible how fast time passes... thank you for the nice discussion!

As you're writing a review and have insight over a lot of methods, I'd be interested in your opinion on the fundamental reason for what I explained above:

the logic for obtaining "PAGA paths" is to simply follow adjacent nodes in the PAGA graph and within each node, following DPT, that is, a line topology this is a "fancy" way of projecting to lines within each node in the PAGA graph

When thinking about the TI problem, we thought that the right thing to do would be to look at the distribution of an "agnostic" ("diffusion") random walk on the graph. However, such a walk is much too undirected to be a meaningful model for a biological trajectory. Hence, we thought that we should condition the distribution of these walks on observed fates, which would make them directed. Then you can compute the mean or mode of this distribution together with some variance and obtain a "band" through the graph that contains realizations of walks that should be a reasonable proxy for biological trajectories.

However, we struggled with evaluating this distribution. We also thought about sampling from the distribution - simulating the walks - but thought that this wouldn't yield convergent estimates as the distribution is so crazy combinatorial high-dimensional. Recently though, a Science paper came out in which the authors have exactly done that and seem to get reasonable results (the method is referred to as URD), at least on their zebrafish embryo dataset. We, instead, thought that we could at least roughly approximate the "band of walks" around the average walk using as a sequence of graph partitions. This thought gave rise to the mentioned logic and has the advantage that it uses a crucial piece of the standard analyses around: the louvain clusters...

Anyways, I'd be very interested in seeing whether URD is robust and able to approximate the path distribution reasonably well in general settings.

Do you already have it in your review?

rcannood commented 5 years ago

If we had answered to this issue 20 days ago, the answer would have been no. But since I'm answering today: I added an URD container yesterday! :) We will rerun the benchmarks soon, so we'll then be able to find out how URD performs.

Since you seem to be satisfied with the files we created for PAGA, I'm closing this issue. If you have any further comments or remarks, feel free to create a new issue :)

zouter commented 5 years ago

Hi @falexwolf

Just a quick question, but does using the connectivities_tree from the PAGA output always return a connected tree? The reason we are wondering this is because we have a hard time getting some disconnected graphs from PAGA, even on very simple toy data:

image

falexwolf commented 5 years ago

Both connectivies (the graph) and connectivies_tree, which is just the maximal connectivity-spanning-subtree of this graph, are "never disconnected". But you'll see very tiny values. The weights in these graphs correspond to the ratio of the observed number of inter-edges versus the number of edges expected under random edge-assignment between any two graph partitions/clusters/branches. The way in which disconnected graphs or trees are obtained is by thresholding the graph weights.

As this is should be possible interactively, it's part of the plotting function pl.paga, where set this threshold by default 0.01, which is a bit conservative. Often we also use 0.05. The threshold on this ratio is essentially the only parameter of PAGA itself. You could translate the ratio into a p-value, but then you are overly confident in the degree of meaningfulness of the null (random assignment of edges between partitions). In essence, you'd never get something connected as you'd always reject the null as already the partitioning algorithm itself introduces a lot of bias. See Suppl. Figure 2 of https://rawgit.com/falexwolf/paga_paper/master/paga.pdf.

zouter commented 5 years ago

Thank Alex for your response! Everything makes sense now :+1: . We updated the wrapper to reflect this: https://github.com/dynverse/ti_paga/blob/master/run.py#L110

We set this parameter default to 0.05, given that it is value which is used most often in your examples. But we can also set it default to 0.01 if that makes more sense to you. The user can change it if they expects the trajectories to be more disconnected trajectories.

PS: I really like that you are honest about this p-value! :+1: Just an idea, but could this value also depend on the density between two clusters (or milestones)?

falexwolf commented 5 years ago

Thank you for updating the wrapper! :smile:

I'd say 0.05 is a meaningful default.

Regarding your PS: Yes, this value depends on all sorts of things. It's really just the zeroth order approximation to judge two things as disconnected if they have less inter-connections as expected under random assignment. It's a rough estimate. But I simply couldn't come up with a more sophisticated model that would be as generally applicable...