Closed MichaelPeibo closed 5 years ago
An UPDATE of results for sigma=28
Now, ~98% cells have been visited, still, I am wondering if there any drawbacks of simply increasing sigma value to make most of cells being visited.
Hi Michael,
The weight of links between cells for pseudotime & random walk calculations is determined based on their distance (in terms of gene expression), then passed through a Gaussian function (whose width is determined by sigma). Thus, increasing the sigma allows cells that are more dissimilar from each other in terms of gene expression to remain connected. The pseudotime calculation function uses those links as a biased coin flip, and it will terminate if it stops visiting new cells. That's why at smaller sigmas, some of your cells do not get visited. You can override this behavior by setting minimum.cells.flooded=0
in the floodPseudotime
command. However, you may find that the function takes a very long time to run (since it will now keep going until every cell is visited, even if that cells is very poorly connected and hard to get to.) Since cell connections are based on their distance in gene expression, the best sigma value is highly dependent on the data set — for instance, it will be affected by how many variable genes are considered, how complex the transcriptomes are, and simply how dramatically gene expression changes during your developmental process.
What it looks like is going on in your data based on the plots you posted is that it seems that you have a few discrete timepoints with big differences in gene expression between them — my guess is that all those cells that didn't get visited comprise a particular timepoint or sample? In that case, you might find that increasing sigma enables you to connect the disparate data points to each other, but it may be at the cost of having too much connectivity within samples/timepoints, such that walks can bleed between trajectories. One command that can help you see if that's happening is plotDists(urd, "pseudotime", "stage", plot.title="Pseudotime by stage")
. You'll have to replace "stage"
with some group ID that's useful for you, but it will help you see whether adjacent stages have some overlap in pseudotime, or whether they are totally discrete. You can also look at the connections from the diffusion map on your tSNE by using plotDim
with the transitions.plot
and transitions.alpha
parameters -- that might let you see whether you have lots of connectivity within stages, with very sparse connections between stages. Additionally, you might try using 'local'
sigma for your diffusion map -- that will base the sigma for each cell on the distance to its nearest neighbors so it might help keep connectivity sparse within timepoints, but increase it at the boundary between timepoints. If none of those things help, it may point to a situation where generating some intervening data is a good idea, or using an analysis method that considers clusters at each stage and connects them to each other, rather than trying to build continuous trajectories (if your data is really not continuous because there are large gaps between the timepoints).
Hi @farrellja Thanks for your very detailed reply! Your guess is correct--
all those cells that didn't get visited comprise a particular timepoint or sample
They are indeed mostly in terminal stage. I currently remove terminal stage cells to carry on the analysis.
Hi URD team, Thanks for providing such a great analysis tool!
I am encountering some trouble in my own data set. Basically, after
calcDM(urd, knn = 100, sigma=16)
andfloodPseudotime(urd, root.cells=root.cells, n=150, minimum.cells.flooded=2, verbose=T)
, I have some cells not assigned pseudotime values, especially at some terminal stage(see below)I checked running log, and 80% cells are visited(14597 in total), when I increase sigma value from
16
into20
and about 90% cells are visited, I am not getting what the results look like forsigma=20
(takes long time at this step), but it looks like increasing sigma value can increase percent of cells visited.However, is there any drawbacks by only increasing sigma value?
Even though I have read this detailed issue, I guess my issue is related to poor pseudotime calculation, rather than not enough random walk simulations(see below)