Reduce computation time with very few number of subsamples

rmathieu25 commented 2 years ago

Hi!

Thank you for this great package.

Even by using 100 subsamples, it takes a very long time (more than 30h at the moment but it is still running) to run the function runPseudotimeDE with about 2000 cells and 5000 genes with 18 cores (and I have much more trajectory to test) .

Therefore, I was wondering, in your opinion, to reduce the computational time, is it better to ignore the pseudotime uncertainty and just take the fix.p-value? or Is it still better to take the uncertainty into account with very few number of subsamples (10, 5 or even 2)?

Thank you very much in advance.

SONGDONGYUAN1994 commented 2 years ago

Hi Remi, Thank you very much for your interest! Yes, the computational time is a problem. Here are my suggestions:

Yes, you can use the fixed p-value. It will cause some problems in FDR control, but it usually will not alter the rank of significance. For instance, if you want to select the top 500 genes, using fixed p-values should lead to similar results to the correct p-values. You can also do some filtering based on fixed p-values (e.g., fixed p-value > 0.1).
"with very few number of subsamples" will not work for the current version, but we are trying to apply our p-value free FDR control method Clipper by using this idea. If you would like to wait a few days we can provide an updated version.
We are also working on using a faster regression model; perhaps we will update it in the following days Thank you again for your patience!

Best, Dongyuan

rmathieu25 commented 2 years ago

Thank you very much for your quick answer.

It would be great if you can have an updated version in a few days!

Thank you again.

Best

rmathieu25 commented 2 years ago

Hi again!

Just to follow on this by not taking into account the uncertainty it took 12,5 h to run.

Regarding your previous message, for filtering based on fixed p-values, you meant fix.p-value < 0.01 right?

Best

SONGDONGYUAN1994 commented 2 years ago

Hi, I have just updated the package. Now the fix.pvalue not taking into account should be much faster than before.

I also add a new parameter, usebam, which might be faster for large sample size (but for small sample size it won't be faster). If you would like to try I would greatly appreciate it. Also, please use nb since zinb is much slower and not useful for most cases (UMI data).

Back to your question, yes, my suggestion is to use fix.pv < 0.05 to select some genes and fit the para.pv to get more reliable p-values, if the computational time is still a problem. Another thing is that for genes with too many zeros (e.g., > 90%) the model will converge poorly. It would be better to filter out genes with almost all zeros.

Thanks!

Best, Dongyuan

ktpolanski commented 1 year ago

Sorry to unearth this a year later, but it feels pertinent to my query. Namely, I've got a dataset even larger than what was brought into the issue, with 20,000+ cells and 20,000+ genes. I think that this is quite representative of scenarios where people would like to use the tool, as it's common to have large datasets these days. In fact, I can see wanting to use this on even more cells!

Based on the discussion here, along with information in the tutorial, I'd go with log-transformed counts and model="gaussian", along with not passing sub.tbl. However, there's some mention of features in development in the issue here, so I figured I'd check what the recommended best practice for handling larger input would be now. Would you recommend any other options? Thanks and sorry for the trouble.

SONGDONGYUAN1994 commented 1 year ago

Hi Krzysztof, Thank you for your question! I agree with you that the data nowadays is larger. I will definitely refine the package later. I will share some current solutions here.

Set usebam = TRUE in runPseudotimeDE if you have > 10'000 cells. This is the faster version for each gene's model fitting. More details can be found here: [(https://www.rdocumentation.org/packages/mgcv/versions/1.8-41/topics/bam)]. The speed is often ~5 times faster.
Set sub.tbl = NULL. This will ignore the pseudotime uncertainty. If you have a very large dataset, conceptually your pseudotime uncertainty should be lower (but I guess this is not guaranteed since different pseudotime inference algorithms may have different properties). This will reduce the computational time greatest since you do not use permutation tests anymore.
Set mc.cores as large as your computer can afford. mc.cores = 10 will be 10 times faster than mc.cores = 1.
Use gaussian can be faster than nb, but the increase is less significant.
Again, gene filtering is a practical step for reducing the computation burden. Usually, for your 20,000+ genes, at least a few thousand of them are very lowly expressed (e.g., > 95% zeros). It does not mean that they are non-DE; however, it means that using regression models may not have enough power to test them.

Best regards, Dongyuan

SONGDONGYUAN1994 / PseudotimeDE

Reduce computation time with very few number of subsamples #10