Questions about pseudobulk_sce

f6v commented 1 year ago

Thanks for developing the package!

I've got couple questions regarding pseudobulk_sce function. My dataset has 14904 genes and 453154 cells. I want to fit a model with the following design:

design = ~ cell_type_L1 + dataset + cell_type_L1:age_group - 1

So, I create pseudobulk profiles like so:

aggr_sce <- pseudobulk_sce(sce, group_by = vars(sample_name, cell_type_L1, age_group, dataset))
aggr_sce

To clarify, counts for each sample_name and cell_type_L1 combination should be aggregated, but I also want to keep age_group and dataset as a covariates.

My questions are:

Does the variable order inside vars make any difference? Do I specify the group_by param correctly?
Is pseudobulk_sce supposed to take a long time? It takes > 2 hrs to create the pseudobulk profiles.

Thanks!

const-ae commented 1 year ago

Hi,

thanks for giving glmGamPoi a try.

Does the variable order inside vars make any difference? Do I specify the group_by param correctly?

No, it doesn't matter. It only affects the order of the columns in the resulting colData(aggr_sce).

Is pseudobulk_sce supposed to take a long time? It takes > 2 hrs to create the pseudobulk profiles.

That seems surprisingly long. I wonder if you are somehow creating many more pseudobulk samples than intended. You can check how many pseudobulk samples are created by running:

colData(sce) %>%
  as_tibble()
  group_by(sample_name, dataset, celltype, age_group) %>%
  summarize(n_cells = n())

Each row of the result tibble is a unique combination of the four covariates and n_cells tells you how many cells are combined for that specifc pseudobulk sample.

f6v commented 1 year ago

Thanks @const-ae!

I looked deeper into it, and it turned out the counts were stored as dgTMatrix rather than dgCMatrix which made everything too slow.

const-ae commented 1 year ago

Ah, great that you found a way to solve the problem :)

const-ae / glmGamPoi

Questions about pseudobulk_sce #46