proper input for marker selection

pchiang5 commented 2 years ago

Hi Constantin,

When I would like to get the markers for each cluster in a scRNA-seq dataset, I tried 2 ways:

treating non-8 cluster as a group in glm_gp()

fit2 <- glm_gp(t(adata$X), design = ~ adata$obs$batch2 + (adata$obs$leiden == '8'))
de_res2 = test_de(fit2, contrast = `adata$obs$leiden == "8"TRUE`)

and the output was:

name pval adj_pval f_statistic df1 df2 lfc 1 A1BG 9.083840e-01 9.443240e-01 1.324322e-02 1 14007.96 -0.19294410 2 A1CF 4.922579e-01 6.659670e-01 4.716120e-01 1 14007.96 0.61182880 3 A2M 6.559798e-26 2.372595e-24 1.112389e+02 1 14007.96 3.38479526 4 A2ML1 9.528825e-01 9.735806e-01 3.491446e-03 1 14007.96 0.04592482 5 A3GALT2 6.607379e-01 7.841244e-01 1.926367e-01 1 14007.96 -32.55924521 6 A4GALT 2.402577e-04 1.314017e-03 1.349370e+01 1 14007.96 2.01775275 7 A4GNT 7.185802e-01 8.225581e-01 1.298624e-01 1 14007.96 0.75398461 8 AAAS 8.593856e-01 9.144625e-01 3.138554e-02 1 14007.96 -0.04804371 9 AACS 1.722583e-01 3.358798e-01 1.863370e+00 1 14007.96 -0.25751422 10 AADAC 5.411227e-01 7.032795e-01 3.734812e-01 1 14007.96 -37.17579965 11 AADACL2 4.919869e-01 6.658566e-01 4.722028e-01 1 14007.96 1.30339179 12 AADACL3 2.822849e-01 4.712254e-01 1.156146e+00 1 14007.96 -1.40612943 13 AADACL4 2.640774e-01 4.506751e-01 1.247355e+00 1 14007.96 1.58819251 14 AADAT 4.133496e-01 6.013788e-01 6.691857e-01 1 14007.96 -0.38910189 15 AAGAB 1.161384e-01 2.520346e-01 2.468946e+00 1 14007.96 0.22126741 16 AAK1 3.768931e-15 7.216806e-14 6.195588e+01 1 14007.96 0.84736921 17 AAMDC 3.412921e-01 5.328369e-01 9.056332e-01 1 14007.96 -0.36246608 18 AAMP 4.683749e-02 1.248763e-01 3.951915e+00 1 14007.96 -0.52769448 19 AANAT 5.366256e-01 6.999187e-01 3.818486e-01 1 14007.96 -0.52204354 20 AAR2 5.863448e-01 7.359875e-01 2.961025e-01 1 14007.96 0.14654954

the other was treating non-target(8) clusters individually in the glm_gp() step:

fit <- glm_gp(t(adata$X), design = ~ adata$obs$batch2 + adata$obs$leiden)
de_res = test_de(fit, contrast = `adata$obs$leiden8`)

the output was:

name pval adj_pval f_statistic df1 df2 lfc 1 A1BG 8.427675e-01 9.401642e-01 3.934607e-02 1 14007.25 -0.34293252 2 A1CF 2.504479e-02 8.752234e-02 5.021866e+00 1 14007.25 2.13935983 3 A2M 2.110166e-21 9.158640e-20 9.053395e+01 1 14007.25 4.27889600 4 A2ML1 5.859041e-01 7.751609e-01 2.968002e-01 1 14007.25 0.48178161 5 A3GALT2 9.997667e-01 1.000000e+00 8.552766e-08 1 14007.25 -0.17163815 6 A4GALT 1.263855e-06 1.337511e-05 2.349799e+01 1 14007.25 3.62761987 7 A4GNT 4.908065e-01 7.069898e-01 4.747828e-01 1 14007.25 1.84253091 8 AAAS 6.669497e-01 8.321722e-01 1.851987e-01 1 14007.25 0.13071683 9 AACS 5.923612e-01 7.796723e-01 2.866851e-01 1 14007.25 0.11325749 10 AADAC 9.999056e-01 1.000000e+00 1.400734e-08 1 14007.25 -1.49534952 11 AADACL2 6.956023e-01 8.524604e-01 1.530937e-01 1 14007.25 0.85935584 12 AADACL3 2.501211e-01 4.736049e-01 1.322737e+00 1 14007.25 -1.61088927 13 AADACL4 8.342623e-03 3.564575e-02 6.960385e+00 1 14007.25 Inf 14 AADAT 2.168300e-01 4.317719e-01 1.525366e+00 1 14007.25 -0.61582857 15 AAGAB 2.974100e-01 5.263028e-01 1.085852e+00 1 14007.25 0.15954452 16 AAK1 2.910548e-10 4.877179e-09 3.979156e+01 1 14007.25 0.74130901 17 AAMDC 3.376498e-01 5.692153e-01 9.193974e-01 1 14007.25 -0.39961874 18 AAMP 3.935166e-03 1.886862e-02 8.316256e+00 1 14007.25 -0.80653633 19 AANAT 1.930534e-01 3.978769e-01 1.694324e+00 1 14007.25 -1.14436345 20 AAR2 9.684653e-01 1.000000e+00 1.562931e-03 1 14007.25 -0.01149323

I found the output was not identical. Which input would be a more reasonable/proper way for the purpose?

Thank you.

const-ae commented 2 years ago

Hi pchiang5,

as you have correctly described there are two approaches when doing differential gene analysis:

Adapt the model formula and make sure that you have only the two groups that you care about in the fit
Fit a model with all groups and then look at the specific contrast you care about.

For the second approach, though, you have to be careful because the coefficients can have different meanings depending on the intercept. Specifically, in your second example, contrast = 'adata$obs$leiden8' measures the difference between group 8 and whatever group was chosen to represent the intercept. This explains why you found that the results between the two approaches don't match.

Which input would be a more reasonable/proper way for the purpose?

In summary, although I usually prefer the second approach because it is more flexible, the current contrast that you are using is wrong for identifying marker genes in cluster 8.

To learn more about correctly forming a model design and choosing the appropriate contrast, I recommend to take a look at:

CoefExplainer and ExploreModelMatrix, two R packages for learning about different model formulas
F1000 paper by the people who developed limma called a guide to creating design matrices for gene expression experiments
The DESeq2 vignette which contains tons of examples how to perform differential expression analysis
Or if you want to learn all the intricacies, the book Statistical Design and Analysis of Biological Experiments contains even more explanations around this whole question.

Best, Constantin

pchiang5 commented 2 years ago

Thank you for your answer and references, Constantin.

const-ae / glmGamPoi

proper input for marker selection #34