farrellja / URD

URD - Reconstruction of Branching Developmental Trajectories
GNU General Public License v3.0
118 stars 41 forks source link

cells' pseudotime assigned as NA #30

Closed MichaelPeibo closed 5 years ago

MichaelPeibo commented 5 years ago

Hi URD team, Thanks for providing such a great analysis tool!

I am encountering some trouble in my own data set. Basically, aftercalcDM(urd, knn = 100, sigma=16) and floodPseudotime(urd, root.cells=root.cells, n=150, minimum.cells.flooded=2, verbose=T), I have some cells not assigned pseudotime values, especially at some terminal stage(see below) image

I checked running log, and 80% cells are visited(14597 in total), when I increase sigma value from 16 into 20 and about 90% cells are visited, I am not getting what the results look like for sigma=20(takes long time at this step), but it looks like increasing sigma value can increase percent of cells visited.

However, is there any drawbacks by only increasing sigma value?

Even though I have read this detailed issue, I guess my issue is related to poor pseudotime calculation, rather than not enough random walk simulations(see below) image

MichaelPeibo commented 5 years ago

An UPDATE of results for sigma=28 image Now, ~98% cells have been visited, still, I am wondering if there any drawbacks of simply increasing sigma value to make most of cells being visited.

farrellja commented 5 years ago

Hi Michael,

The weight of links between cells for pseudotime & random walk calculations is determined based on their distance (in terms of gene expression), then passed through a Gaussian function (whose width is determined by sigma). Thus, increasing the sigma allows cells that are more dissimilar from each other in terms of gene expression to remain connected. The pseudotime calculation function uses those links as a biased coin flip, and it will terminate if it stops visiting new cells. That's why at smaller sigmas, some of your cells do not get visited. You can override this behavior by setting minimum.cells.flooded=0 in the floodPseudotime command. However, you may find that the function takes a very long time to run (since it will now keep going until every cell is visited, even if that cells is very poorly connected and hard to get to.) Since cell connections are based on their distance in gene expression, the best sigma value is highly dependent on the data set — for instance, it will be affected by how many variable genes are considered, how complex the transcriptomes are, and simply how dramatically gene expression changes during your developmental process.

What it looks like is going on in your data based on the plots you posted is that it seems that you have a few discrete timepoints with big differences in gene expression between them — my guess is that all those cells that didn't get visited comprise a particular timepoint or sample? In that case, you might find that increasing sigma enables you to connect the disparate data points to each other, but it may be at the cost of having too much connectivity within samples/timepoints, such that walks can bleed between trajectories. One command that can help you see if that's happening is plotDists(urd, "pseudotime", "stage", plot.title="Pseudotime by stage"). You'll have to replace "stage" with some group ID that's useful for you, but it will help you see whether adjacent stages have some overlap in pseudotime, or whether they are totally discrete. You can also look at the connections from the diffusion map on your tSNE by using plotDim with the transitions.plot and transitions.alpha parameters -- that might let you see whether you have lots of connectivity within stages, with very sparse connections between stages. Additionally, you might try using 'local' sigma for your diffusion map -- that will base the sigma for each cell on the distance to its nearest neighbors so it might help keep connectivity sparse within timepoints, but increase it at the boundary between timepoints. If none of those things help, it may point to a situation where generating some intervening data is a good idea, or using an analysis method that considers clusters at each stage and connects them to each other, rather than trying to build continuous trajectories (if your data is really not continuous because there are large gaps between the timepoints).

MichaelPeibo commented 5 years ago

Hi @farrellja Thanks for your very detailed reply! Your guess is correct--

all those cells that didn't get visited comprise a particular timepoint or sample

They are indeed mostly in terminal stage. I currently remove terminal stage cells to carry on the analysis.