Best Practice for imputing big data-sets

KlugerLab / ALRA

Imputation method for scRNA-seq based on low-rank approximation

MIT License

73 stars 19 forks source link

Best Practice for imputing big data-sets #1

Closed AMA111 closed 5 years ago

AMA111 commented 5 years ago

Hey ALRA team, I would like to ask about your recommendations regarding imputing a big data-set (~ 250k cell - 10x RNA-seq data). I am using Seurat v3 and I would like to know at which step should I start the imputation in the standard workflow and the new Seurat v3 integration workflow. Also, do recommend certain parameters for such big data-set? Finally, do you have an estimate for the computation time and requirement for the imputation step?

Thanks in advance for your time.

Best, Abdelrahman

linqiaozhi commented 5 years ago

Wow-apologies for the delay, not sure how this issue escaped us.

I have not tried it on such a large dataset, but ALRA should scale nicely. The time consuming step is the randomized SVD, which is actually very fast. Would be happy to hear if you ended up trying it or not.

As for parameters, ALRA's only parameter is k, which is actually chosen automatically. So I would go with what it chooses, unless you get unfavorable results.

As for using with Seurat, typically you would do the NormalizeData() first.

ghost commented 8 months ago

@AMA111 Hello, ANA111. I, too, work with large datasets in my analyses. I've encountered an issue related to sparseMatrix. Have you faced a similar challenge by any chance?

[Error occurred] Error in .m2sparse(from, paste0(kind, "g", repr), NULL, NULL): attempt to construct sparseMatrix with more than 2^31-1 nonzero entries