glm_gp removes cell type from dataset

pakiessling commented 6 months ago

Hi,

thanks for the tool, it looks super powerful!

I am trying it out on my scRNA dataset, with the goal of finding marker genes for cell types.

sce <- pseudobulk(sce, group_by = vars(patient, cell_subtype))
fit <- glm_gp(sce,
    design = ~ patient + cell_subtype,
    size_factors = "deconvolution", on_disk = FALSE
)

The data had 29126 rows and 75 columns.
A model with 19 coefficient was fitted.
The design formula is: Y~patient + cell_subtype

Beta:
                Min 1st Qu.    Median 3rd Qu.    Max
Intercept -474406.5  -0.632  1.59e+00   3.576     13
patientP2     -50.2  -0.343 -7.17e-10   0.326 336693
patientP3     -49.4  -0.626 -1.34e-02   0.348 336694
...

deviance:
      Min 1st Qu. Median 3rd Qu.  Max
 3.66e-19    23.4     36    46.2 1148

overdispersion:
 Min 1st Qu. Median 3rd Qu.  Max
   0  0.0221 0.0706   0.227 57.6

Shrunken quasi-likelihood overdispersion:
   Min 1st Qu. Median 3rd Qu.  Max
 0.397   0.884      1    1.19 45.3

size_factors:
   Min 1st Qu. Median 3rd Qu.  Max
 0.021   0.329   1.29    4.69 21.5

Mu:
 Min 1st Qu. Median 3rd Qu.     Max
   0   0.123   3.89      43 9066659

The weird thing is that one cell type is consistently removed from fit$Beta, even though the pseudobulks of this cell type have a good amount of cells and replicates.

Any idea what could cause this?

Thanks a lot!

const-ae commented 6 months ago

Hey, thanks for reaching out.

The weird thing is that one cell type is consistently removed from fit$Beta

I assume you mean that when you look at the columns names of Beta, you get something like "Intercept", "patient2", "patient3", "celltype2", "celltype3", "celltype4", "celltype5", ..., correct?

This behavior is common for any package that works with a design matrix. We can quickly check this behavior:

df <- data.frame(patient = sample(paste0("pat_", 1:3), size = 30, replace = TRUE),
                 celltype = sample(paste0("ct", 1:5), size = 30, replace = TRUE))
design_matrix <- model.matrix(~ patient + celltype, data = df)
colnames(design_matrix)
#> [1] "(Intercept)"  "patientpat_2" "patientpat_3" "celltypect2"  "celltypect3" 
#> [6] "celltypect4"  "celltypect5"

^{Created on 2024-04-15 with reprex v2.1.0}

There is a nice in-depth article on design matrices for gene expression studies on F1000.

The trick is that your "celltype1" becomes the default level and all other coefficients describe how much the fit deviates from the condition where patient = "pat_1" and celltype = "ct_1". As this can take some time to get used to, I build a small package a few years ago to illustrate this: https://github.com/const-ae/CoefExplainer.

I hope this helps.

Best, Constantin

pakiessling commented 6 months ago

Ah that makes sense. Thanks for the in-depth explanation. I was trying to do something like this:

for (subtype in  rowData(sce)$cell_subtype) {
    print(paste("Processing cell subtype:", subtype))
    de_res <- test_de(fit, contrast = subtype)
    file_name <- paste0("de_bulk_", subtype, ".csv")
    write.table(de_res, file_name, sep = ",")
}

to do a one vs rest comparison for every cell type and it always crashed when looping over a certain cell type. Should probably be something like:


contrast <- makeContrasts(cell_subtypeA - (cell_subtypeB + cell_subtypeC + cell_subtypeD))
de_res <- test_de(fit, contrast = contrast )

instead, if I understand your link right . And do that for every comparison? But I still don't get how to do that for cell_subtypeEthat has been chosen as default 😅

Ty!

const-ae commented 6 months ago

The second approach is already very close. Take a look at the cond function, it automatically creates the contrast vector for each condition. So you would need to write, something like

de_res <- test_de(fit, contrast = cond(cell_type = "A") - (cond(cell_type = "B") + cond(cell_type = "C) + cond(cell_type = "D") + cond(cell_type = "E")) / 5)
de_res <- test_de(fit, contrast = cond(cell_type = "B") - (cond(cell_type = "A") + cond(cell_type = "C) + cond(cell_type = "D") + cond(cell_type = "E")) / 5)
...
de_res <- test_de(fit, contrast = cond(cell_type = "E") - (cond(cell_type = "A") + cond(cell_type = "B") + cond(cell_type = "C) + cond(cell_type = "D")) / 5)

pakiessling commented 6 months ago

Thank you so much for educating me! Highly appreciate it!

const-ae / glmGamPoi

glm_gp removes cell type from dataset #61