PAGA - Githubissues

zouter commented 6 years ago

Hello @falexwolf and @theislab

This issue is for discussing the wrapper for your trajectory inference method, PAGA, which we wrapped for our benchmarking study (10.1101/276907). In our dynmethods framework, we collected some meta information about your method, and created a docker wrapper so that all methods can be easily run and compared. The code for this wrapper is located in docker containers[1][2]. The way this container is structured is described in this vignette.

We created 2 separate wrappers:

PAGA: The regular version in which cells are assigned to clusters, and the network between these clusters is used as milestone network.
projected PAGA: Similar to regular PAGA, but cells are projected onto the edges between milestones (in the space of the dimensionality reduction).

We are creating this issue to ensure your method is being evaluated in the way it was designed for. The checklist below contains some important questions for you to have a look at.

[x] Parameters, defined in definition.yml (more info)
- Are all important parameters described in this file?
- For each parameter, is the proposed default value reasonable?
- For each parameter, is the proposed parameter space reasonable (e.g. lower and upper boundaries)?
- Is the description of the parameters correct and up-to-date?
[x] Input, defined in definition.yml and loaded in run.py (more info)
- Is the correct type of expression requested (raw counts or normalised expression)?
- Is all prior information (required or optional) requested?
- Would some other type of prior information help the method?
[x] Output, defined in definition.yml and saved in run.py (more info)
- Is the output correctly processed towards the common trajectory model? Would some other postprocessing method make more sense?
- Is all relevant output saved (dimensionality reduction, clustering/grouping, pseudotime, ...)
[x] Wrapper script, see run.py (more info)
- This is a script that is executed upon starting the docker container. It will receive several input files as defined by definition.yml, and is expected to produce certain output files, also as defined by definition.yml.
- Is the script a correct representation of the general workflow a user is expected to follow when they want to apply your method to their data?
[x] Quality control, see the qc worksheet
- We also evaluated the implementation of a method based on a large check list of good software software development practices.
- Are the answers we wrote down for your method correct and up to date? Do you disagree with certain answers? (Feel free to leave a comment in the worksheet)
- You can improve the QC score of your method by implementing the required changes and letting us know. Do not gloss over this, as it is the easiest way to improve the overall ranking of your TI method in our study!

The most convenient way for you to test and adapt the wrapper is to install dyno, download and modify these files, and run your method on a dataset of interest or one of our synthetic toy datasets. This is further described in this vignette. Once finished, we prefer that you fork the dynmethods repository, make the necessary changes, and send us a pull request. Alternatively, you can also send us the files and we will make the necessary changes.

If you have any further questions or remarks, feel free to reply to this issue.

Kind regards, @zouter and @rcannood

falexwolf commented 6 years ago

Hi! Wow, thank you! And sorry for the late response. I was offline on holidays for a week. I will go through this today or tomorrow.

falexwolf commented 6 years ago

This is already doing a very good job but misses a few important things, probably because we were unable to properly communicate our updated preprint and repository - the journal doesn't allow to revise preprints...

Should I simply go and make a pull request here and add the few things that I think miss? Or would you prefer to read this here in the issues and then go ahead and implement it yourself?

The revised PAGA preprint is here: https://rawgit.com/falexwolf/paga_paper/master/paga.pdf The revised and final PAGA repository is here: https://github.com/theislab/paga - please link to this repository...

zouter commented 6 years ago

Hi @falexwolf

Had a look at the updated preprint, looks great! Indeed too bad that the journal doesn't allow preprints, we'll run into that issue as well...

If you have the time, feel free to do a pull request! Specifically for PAGA we found it kind of hard to extract a continuous trajectory from the data with pseudotimes, so for the first wrapper we simply used the graph between clusters. We also included a second wrapper which will project the cells within the dimensionality reduction on the edges between the clusters. But as both these approaches will probably work suboptimally, it would be great if you could help us out with this!

Indeed, we didn't know about the updated repository, feel free to also change it in the pull request.

Soon, we will also add functionality to dynverse/dynwrap so that it will no longer have to save the data to disk for transfor to the docker, which will hopefully make the docker also work well on large datasets.

Wouter

falexwolf commented 6 years ago

Hi Wouter! Thanks!

I'll go ahead and make a PR today or tomorrow.

We also included a second wrapper which will project the cells within the dimensionality reduction on the edges between the clusters.

Ah... I see. I wrote an extensive comment in the QC worksheet (seems to have already disappeared) essentially stating that PAGA doesn't give you single-cell orderings: for that, you need to combine it with the extended DPT in Scanpy. I'll make sure that this standard workflow is reflected in the pull request. Evidently, PAGA for a single line topology is not of much use... one can directly use DPT for this... Initially, we had a PAGA tool that would internally call DPT for the cell orderings. But we thought it's more transparent to disentangle this and call the two independent tools (DPT, PAGA) subsequently.

zouter commented 6 years ago

Hi @falexwolf

I managed to get the pseudotime working, thanks!

I added the following to the pull request

Required root cell
The wrapper will now use the raw counts instead of the (log) normalised expression
The method now returns a "branch trajectory", which needs three things:
- Branch progressions: for each cell an assigned branch (= louvain cluster) + the pseudotime (percentage) within that branch. I had to transform the pseudotime vector a little bit to extract this: for each louvain cluster I scaled the pseudotime between 0 and 1.
- Branch network: How the branches are connected. Although extracting the network is easy, I had to write some extra lines to extract the directionality of the edges. I inferred the directionality of the edges by averaging the pseudotime within every louvain cluster, so that if A -> B then the average pseudotime of cells within A should be lower than B
- Branches: Which tells you the length of each branch. I used the difference between the max and min dpt_pseudotime for this

Based on this I got some sensible results on some datasets I tried:

A toy (!) linear dataset

A toy (!) bifurcating dataset

A (small) real dataset

If you want I can also test it out with more complex and larger datasets.

falexwolf commented 6 years ago

Dear @zouter,

impressing how fast you're doing these things. The pictures above look sensible to me. They also look very nice. Kudos on your awesome project/package/environment.

Regarding your comments:

required root cell is good!
using raw counts and then a preprocessing recipe is good
it can definitely be that multiple louvain clusters appear in a line topology; hence I'd call each louvain cluster a segment of the line and not a branch... maybe this is how you're understanding the notion branch anyways...
you're right, the PAGA graph is not directed... it's directed when you infer it based on RNA velocity (see Figure 3 of the updated preprint here)... still, even though the PAGA graph for plain gene expression data is not directed, directions are often implicit as the topology is so clear... the logic for obtaining "PAGA paths" is to simply follow adjacent nodes in the PAGA graph and within each node, following DPT, that is, a line topology - this is a "fancy" way of projecting to lines within each node in the PAGA graph... it's very different from projecting on edges between nodes in a graph (as done, e.g., by StemID)... I think it's a good assumption as within each node, there is a dense distribution of cells which is - for a given resolution - meaningfully approximated as an object with trivial topology: a dot or a line. By contrast, the distribution of cells between clusters or nodes in the PAGA graph, is very sparse and it's hard to be confident about how cells exactly are distributed... hence, we only do the connectivity test which states that some clusters should be connected and others shouldn't
the ways in which you extract directionality from the undirected PAGA graph ("A -> B then the average pseudotime of cells within A should be lower than B") as you describe above sound fine to me, this will never be violated; an alternative is to just use the shortest path on the graph weighted by inverse connectivity from starting node to end node... to do this unsupervised, one needs to figure out all degree-one nodes as end nodes and computes all shortest paths in the PAGA graph... all of this are just a few lines of code and I've done this but I didn't put this in the tutorial, only in the robustness notebooks for PAGA... let me know if you need this / the whole things is robust as the PAGA graph is essentially noise-free
the convention that you normalize pseudotime from 0 to 1 within each branch is, of course, also fine

I'm pasting a few examples that hopefully illustrate what I write above...

Very happy to discuss further...

rcannood commented 6 years ago

Since @falexwolf and @zouter seemed to agree on the current wrapper, I merged PR #77 into devel :)

zouter commented 6 years ago

Hi Alex

Indeed you're right that edge would be a better name than a branch, as a branch can consist of several branches.
We will probably try to add the ability to include RNA velocity information later, but that's a really powerful feature that PAGA can use it to infer directionality! On the long term we will probably also include some other directionality measures (eg. based on stemness), but that's long term ;). In principle, the only way to add directionality now is to root the trajectory based on some root cell or milestone.
Thank you for the DPT-PAGA explanation, it is clear to me now. Sounds like a good idea!
If the directionality inference based on average pseudotimes is fine for you, than I would keep it as it is now. The directionality of an edge will, at the current state, also not influence in any way the evaluation scores.

Thanks for all the nice feedback, and if you want anything changed, feel free to ask / make a pull request!

falexwolf commented 6 years ago

Hi Wouter,

incredible how fast time passes... thank you for the nice discussion!

As you're writing a review and have insight over a lot of methods, I'd be interested in your opinion on the fundamental reason for what I explained above:

the logic for obtaining "PAGA paths" is to simply follow adjacent nodes in the PAGA graph and within each node, following DPT, that is, a line topology this is a "fancy" way of projecting to lines within each node in the PAGA graph

When thinking about the TI problem, we thought that the right thing to do would be to look at the distribution of an "agnostic" ("diffusion") random walk on the graph. However, such a walk is much too undirected to be a meaningful model for a biological trajectory. Hence, we thought that we should condition the distribution of these walks on observed fates, which would make them directed. Then you can compute the mean or mode of this distribution together with some variance and obtain a "band" through the graph that contains realizations of walks that should be a reasonable proxy for biological trajectories.

However, we struggled with evaluating this distribution. We also thought about sampling from the distribution - simulating the walks - but thought that this wouldn't yield convergent estimates as the distribution is so crazy combinatorial high-dimensional. Recently though, a Science paper came out in which the authors have exactly done that and seem to get reasonable results (the method is referred to as URD), at least on their zebrafish embryo dataset. We, instead, thought that we could at least roughly approximate the "band of walks" around the average walk using as a sequence of graph partitions. This thought gave rise to the mentioned logic and has the advantage that it uses a crucial piece of the standard analyses around: the louvain clusters...

Anyways, I'd be very interested in seeing whether URD is robust and able to approximate the path distribution reasonably well in general settings.

Do you already have it in your review?

rcannood commented 5 years ago

If we had answered to this issue 20 days ago, the answer would have been no. But since I'm answering today: I added an URD container yesterday! :) We will rerun the benchmarks soon, so we'll then be able to find out how URD performs.

Since you seem to be satisfied with the files we created for PAGA, I'm closing this issue. If you have any further comments or remarks, feel free to create a new issue :)

zouter commented 5 years ago

Hi @falexwolf

Just a quick question, but does using the connectivities_tree from the PAGA output always return a connected tree? The reason we are wondering this is because we have a hard time getting some disconnected graphs from PAGA, even on very simple toy data:

falexwolf commented 5 years ago

Both connectivies (the graph) and connectivies_tree, which is just the maximal connectivity-spanning-subtree of this graph, are "never disconnected". But you'll see very tiny values. The weights in these graphs correspond to the ratio of the observed number of inter-edges versus the number of edges expected under random edge-assignment between any two graph partitions/clusters/branches. The way in which disconnected graphs or trees are obtained is by thresholding the graph weights.

As this is should be possible interactively, it's part of the plotting function pl.paga, where set this threshold by default 0.01, which is a bit conservative. Often we also use 0.05. The threshold on this ratio is essentially the only parameter of PAGA itself. You could translate the ratio into a p-value, but then you are overly confident in the degree of meaningfulness of the null (random assignment of edges between partitions). In essence, you'd never get something connected as you'd always reject the null as already the partitioning algorithm itself introduces a lot of bias. See Suppl. Figure 2 of https://rawgit.com/falexwolf/paga_paper/master/paga.pdf.

zouter commented 5 years ago

Thank Alex for your response! Everything makes sense now :+1: . We updated the wrapper to reflect this: https://github.com/dynverse/ti_paga/blob/master/run.py#L110

We set this parameter default to 0.05, given that it is value which is used most often in your examples. But we can also set it default to 0.01 if that makes more sense to you. The user can change it if they expects the trajectories to be more disconnected trajectories.

PS: I really like that you are honest about this p-value! :+1: Just an idea, but could this value also depend on the density between two clusters (or milestones)?

falexwolf commented 5 years ago

Thank you for updating the wrapper! :smile:

I'd say 0.05 is a meaningful default.

Regarding your PS: Yes, this value depends on all sorts of things. It's really just the zeroth order approximation to judge two things as disconnected if they have less inter-connections as expected under random assignment. It's a rough estimate. But I simply couldn't come up with a more sophisticated model that would be as generally applicable...

dynverse / dynmethods

PAGA #15