General questions about this package

wflynny commented 6 years ago

I was reading through the preprint and a few questions came up. Any answers would be appreciated.

[ ] Can this package operate on datasets with more than 1 million cells? There is this 1.3M cell dataset available from 10X.
[ ] The clustering benchmarks don't seem to include the popular community detection algorithms used by Seurat, 10X CellRanger, or SCANPY. Any reason why?
[ ] Speaking of clustering benchmarks, where did the idea of using DBSCAN on a t-SNE projection come from? I've experimented with it extensively myself [example], but I haven't seen it anywhere in the literature except for allusions to it from the Kluger lab at Yale.
[ ] The timings seem pretty slow (figure 1 shows timings for 3000 cells). Have you done a head to head comparison time wise between Seurat, SCANPY, and this package?
[ ] Speaking of Seurat and SCANPY, I find the fact that they aren't mentioned anywhere in the manuscript strange.
[ ] You use boosted trees for marker gene identification. How does it compare to the myriad one-vs-rest tests (Wilcoxon, t-test, AUROC, etc.) or standard logistic regression as suggested by Lior Pachter?

Thanks!

logstar commented 6 years ago

@wflynny Thank you for your questions. I will try to answer them.

Can this package operate on datasets with more than 1 million cells? There is this 1.3M cell dataset available from 10X.

Depends on your objective:

Clustering: No for t-SNE based clustering. No for raw data based clustering. Maybe for UMAP based clustering, depending on whether UMAP scales to the 1.3M cell dataset on your machine.
Filtering and imputation: No. We are going to provide approximate nearest neighbor implementations, which would scale to 1.3M cell dataset.
Identifying cluster separating genes: Yes, because it uses XGBoost.
Visualization: Yes for scatter plots and read count heatmaps. No for pairwise distance heatmaps, because they take too much memory.

The clustering benchmarks don't seem to include the popular community detection algorithms used by Seurat, 10X CellRanger, or SCANPY. Any reason why?

SC3 has been comprehensively compared to Seurat, so I decided not to repeat the work. With regard to 10x CellRanger and SCANPY, I decided not to include them because they did not provide benchmarks with others.

Speaking of clustering benchmarks, where did the idea of using DBSCAN on a t-SNE projection come from? I've experimented with it extensively myself [example], but I haven't seen it anywhere in the literature except for allusions to it from the Kluger lab at Yale.

It's from E. Z. Macosko et al., “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets,” Cell, vol. 161, no. 5, pp. 1202–1214, May 2015. In their supplemental subsection "Density Clustering to Identify Cell-Types":

To identify putative cell types on the tSNE map, we used a density clustering approach implemented in the DBSCAN R package (Ester et al., 1996), initially setting the reachability distance parameter (eps) to 1.0, and removing clusters less than 20 cells, then setting eps to 1.9, and removing clusters less than 50 cells.

Thank you for the reference to the interesting paper by Kluger lab.

The timings seem pretty slow (figure 1 shows timings for 3000 cells). Have you done a head to head comparison time wise between Seurat, SCANPY, and this package?

No. Whether application packages, like Scedar/Seurat/SCANPY, is slow or fast depends on the underlying implementations provided by library packages, like numpy/matplotlib/scikit-learn. Therefore, simply comparing the speed between two application packages is actually comparing the underlying library packages. In the paper, we benchmarked runtime in the paper to approximately compare the asymptotic time complexity of different algorithms.

I agree it is not very fast on 3,000 cells. If you would like to speed up your analysis, try lowering down t-SNE iterations and turning off optimal ordering in MIRAC.

Speaking of Seurat and SCANPY, I find the fact that they aren't mentioned anywhere in the manuscript strange.

We added references to them in our revision on bioRxiv, suggested by SCANPY's author Alex Wolf. I noticed and looked into the update of Seurat and publication of SCANPY for scalable scRNA-seq data analysis, but I forgot to add references to them during the revision of our draft.

You use boosted trees for marker gene identification. How does it compare to the myriad one-vs-rest tests (Wilcoxon, t-test, AUROC, etc.) or standard logistic regression as suggested by Lior Pachter?

We have not looked into the comparisons, because we are not intending to surpass various statistical methods for finding marker genes. We developed the boosted tree method for data exploration rather than statistical testing.

I will currently close this issue, but feel free to follow up if you have further questions.

wflynny commented 6 years ago

@logstar Thanks for the answers. Good luck with the review process. I look forward to reading the finished paper and trying out your package.

logstar commented 6 years ago

@wflynny Thank you again for your questions and interest. We are still actively working on extending this package. Once it completely scales to millions of cells, I will make a major release and provide a tutorial notebook on the 1.3M cell dataset you mentioned.

TaylorResearchLab / scedar

General questions about this package #1