Open rmathieu25 opened 2 years ago
Hi Remi, Thank you very much for your interest! Yes, the computational time is a problem. Here are my suggestions:
Best, Dongyuan
Thank you very much for your quick answer.
It would be great if you can have an updated version in a few days!
Thank you again.
Best
Hi again!
Just to follow on this by not taking into account the uncertainty it took 12,5 h to run.
Regarding your previous message, for filtering based on fixed p-values, you meant fix.p-value < 0.01 right?
Best
Hi, I have just updated the package. Now the fix.pvalue not taking into account should be much faster than before.
I also add a new parameter, usebam
, which might be faster for large sample size (but for small sample size it won't be faster). If you would like to try I would greatly appreciate it. Also, please use nb
since zinb
is much slower and not useful for most cases (UMI data).
Back to your question, yes, my suggestion is to use fix.pv < 0.05 to select some genes and fit the para.pv to get more reliable p-values, if the computational time is still a problem. Another thing is that for genes with too many zeros (e.g., > 90%) the model will converge poorly. It would be better to filter out genes with almost all zeros.
Thanks!
Best, Dongyuan
Sorry to unearth this a year later, but it feels pertinent to my query. Namely, I've got a dataset even larger than what was brought into the issue, with 20,000+ cells and 20,000+ genes. I think that this is quite representative of scenarios where people would like to use the tool, as it's common to have large datasets these days. In fact, I can see wanting to use this on even more cells!
Based on the discussion here, along with information in the tutorial, I'd go with log-transformed counts and model="gaussian"
, along with not passing sub.tbl
. However, there's some mention of features in development in the issue here, so I figured I'd check what the recommended best practice for handling larger input would be now. Would you recommend any other options? Thanks and sorry for the trouble.
Hi Krzysztof, Thank you for your question! I agree with you that the data nowadays is larger. I will definitely refine the package later. I will share some current solutions here.
usebam = TRUE
in runPseudotimeDE
if you have > 10'000 cells. This is the faster version for each gene's model fitting. More details can be found here: [(https://www.rdocumentation.org/packages/mgcv/versions/1.8-41/topics/bam)]. The speed is often ~5 times faster.sub.tbl = NULL
. This will ignore the pseudotime uncertainty. If you have a very large dataset, conceptually your pseudotime uncertainty should be lower (but I guess this is not guaranteed since different pseudotime inference algorithms may have different properties). This will reduce the computational time greatest since you do not use permutation tests anymore.mc.cores
as large as your computer can afford. mc.cores = 10
will be 10 times faster than mc.cores = 1
. gaussian
can be faster than nb,
but the increase is less significant.Best regards, Dongyuan
Hi!
Thank you for this great package.
Even by using 100 subsamples, it takes a very long time (more than 30h at the moment but it is still running) to run the function runPseudotimeDE with about 2000 cells and 5000 genes with 18 cores (and I have much more trajectory to test) .
Therefore, I was wondering, in your opinion, to reduce the computational time, is it better to ignore the pseudotime uncertainty and just take the fix.p-value? or Is it still better to take the uncertainty into account with very few number of subsamples (10, 5 or even 2)?
Thank you very much in advance.