Open ismailelshimy opened 2 years ago
Hi Ismail,
Thank you for your interest in our simulator. Yes, scDesign2 can be used for benchmarking DE gene detection methods. Briefly speaking, suppose we have the real gene expression data from two cell types. We can first fit two sets of models for these two cell types, using scDesign2. And then, we can tweak the parameters in these two sets of fitted models. For example, we can manually set the fitted mean parameters to be the same for some genes, while keeping them to be different for other genes. (You can also tweak other parameters if that is the interest.) Next, we can use scDesign2 to generate synthetic data from these parameter-adjusted models, and then test DE gene detection methods to see their performance.
Above is the general idea, and I'm actually paraphrasing an analysis we performed in this paper. You can look at the benchmarking we performed in the section "Input data: observed vs. imputed vs. binarized counts". Moreover, you can check the code in this accompanying zenodo repository for some details. You can look at the code in the "Data simulation" folder.
Best regards, Tianyi
Hi Tianyi,
Thank you very much for your very informative reply and for drawing my attention to the review "Statistics or biology: the zero-inflation controversy about scRNA-seq data" which I found very interesting to read.
I just want to point out that I am interested to perform DE analysis of genes in cells belonging to the same cluster from two groups of study subjects: cases and controls. I assume that it is possible to use the same procedure you kindly suggested in this particular context i.e. fit two sets of models to the two populations of cells (belonging to a single cluster): cases vs controls instead of fitting models to two distinct cell types. Am I correct?
Also, since I am not a statistitian and I do not have sufficient computational backaground, I have a question for you: How would you recommend that I deal with the problem of pseudoreplication i.e. cells coming from different patients? Some analysts (https://www.nature.com/articles/s41467-021-21038-1) recommend using a mixed-effect model such as MAST model (where patients are added as a random effect). I actually tried such a model to test DE, but I got very few significant genes at an adjusted p-value of less than 0.05. Do you have other recommendations to deal with this issue?
Thank you again for your time.
Best regards, Ismail
Hi Ismail, For your question, "perform DE analysis of genes in cells belonging to the same cluster from two groups of study subjects: cases and controls," you may try our new version, scDesign3. Please check our tutorial: https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3-conditionEffect-vignette.html . The manuscript is still under preparation and we plan to upload it to biorXiv this week. Let me know if you have any questions!
Best, Dongyuan
@ismailelshimy You are right. You can replace the two distinct cell types with cases vs. controls. And I'm not familiar with the pseudoreplication problem, so I'm afraid I can't help you with that. You may consult people with more experience in single-cell DE analysis.
@SONGDONGYUAN1994 Sorry for my late reply, Thank you very much for your kind reply, I will check scDesign3 and come back to you if I have questions.
@sunty17 I see, thanks again for your kind help. I really appreciate it !!
Sure! Let me know if you have any questions. Our biorxiv version is: https://www.biorxiv.org/content/10.1101/2022.09.20.508796v2
Hello JSB-UCLA,
I have a question for you. I was looking for synthetic scRNA seq data that would resemble my own scRNA seq data sets and which I could use to test the performance of two methods of differential gene expression. I was wondering if synthetic data generated by scDesign2 can be used for this particular purpose. More specifically, I wanna compare the two methods in terms of their type I and type II error rates using a synthetic data set in which the ground truth is known.
You mentioned in your paper that:
ScDesign2 should retain every gene’s distribution of expression levels in its synthetic data without deleting genes in real data. This property is essential for benchmarking differential gene expression analysis.
scDesign2 can assist differential gene expression analysis. Its estimated marginal distributions of individual genes in different cell types can be used to investigate more general patterns of differential expression (such as different variances and different zero proportions), in addition to comparing gene expression means between two groups of cells.
However, I am not sure how one can perform this benchmarking practically using the package. I checked the vignette and could not find relevant information about this particular application.
If you could kindly elaborate more on how I can perform this comparison using scDesign2, I would be really grateful !
Thank you very much for your time and help.
Best regards, Ismail