More than 70% of Cells Not Getting Walked

@stephenchea

It's true that if some populations that exist are not selected as tips, portions of the data might not be walked (i.e. the progenitors of that cell type). But it sounds like you've looked into this. The previous parameters that could affect this would include whether there are major batch effects that lead to poor linkages between some samples, a variable gene list that includes many non-biologically important genes that might cause cells to connect primarily according to technical features, rather than cell type features, a group of cells that create major short circuits in the data (e.g. a recurrent population of cell doublets or perhaps a strong cell state signature like a DNA damage response that connects many cell types), or the wrong choice of sigma in the diffusion map creating portions with no/low connectivity.

My guess is that at this point, it would be productive to spend some time looking at the visitation paths from some of your tips and figuring out which cells are not getting visited, or whether there are commonalities that shouldn't exist in the walks. It will be helpful both to troubleshoot URD but also potentially help inform the caveats / strengths-weaknesses of your data set. If you have a dimensionality reduction calculated (like tSNE for instance, or you could copy a UMAP projection coordinates in the object@tsne.y slot), you can inspect where the walks are going using plotDim(..., label="visitfreq.log.3"), where 3 represents visitation of cells by tip number 3. This will plot the log of visitation in each cell by walks from tip 3. You can also use plotDim with transitions.plot and transitions.df parameters to visualize diffusion map connectivity on your projection, which could help see if there are regions that are not connected to each other that could cause cells to not be walked.

You might see something like there's a group of cells from a later stage that most walks pass through (and thereby end up avoiding a bunch of the data) -- and then it would be instructive to figure out what's up with them (particularly high/low library complexity, specific signature of cell cycle or other cellular response, doublet signatures, etc). You might see that some portions are not connected -- if this falls along sample/stage lines, it could indicate a batch issue (that might be solved by batch correction or by excluding batch-related genes from the variable gene list), or potentially that the timepoints are too far apart (such that changes in gene expression between timepoints are larger than changes in gene expression between different cell types). Or you might see some other issues that suggest a solution or further discussion.

farrellja / URD

More than 70% of Cells Not Getting Walked #42