const-ae / glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably
105 stars 14 forks source link

Use for pseudobulk differential expression - advantages & logFC values #22

Open Al-Murphy opened 3 years ago

Al-Murphy commented 3 years ago

Hi,

Thank you for your very useful package. I have two questions regarding its use for pseudobulk differential expression analysis.

Firstly, could you outline the reasons why you think your model is better for pseudobulk than alternatives like a manual pseudobulk step and edgeR/DEseq, given that glmGamPoi's main use seems to be for non-pseudobulk?

Secondly, I have noted strange logFC values when performing pseudobulk differential expression analysis on a Alzheimer's Disease split by 6 cell types. The dataset has approx 50 samples, resulting in >50k cells after quality control. The logFC values can be seen in this volcano plot:

image

The logFC values for all cell types appears to be split into three groups and does not appear as I would expect in a volcano plot. Have you noted logFC values like this before? I have attached the table of this data for just cell type "A" (to keep the size down). DE_analysis_odd_logFC_values.txt

const-ae commented 3 years ago

Hey Alan,

thanks for your kind words and please excuse the delay/

Firstly, could you outline the reasons why you think your model is better for pseudobulk than alternatives like a manual pseudobulk step and edgeR/DEseq, given that glmGamPoi's main use seems to be for non-pseudobulk?

Conceptually none, really. It is more of a convenience that you can use the same interface for pseudobulk and non-pseudobulk questions.

The logFC values for all cell types appears to be split into three groups and does not appear as I would expect in a volcano plot. Have you noted logFC values like this before?

Yes, I am aware of this pattern in volcano plots. The easy fix is to set all LFC above to let's say 15 to Inf. You will then see the familiar volcano pattern from the center column.

The underlying issue is that the LFC for the columns on the left and right come from comparisons where all counts are 0 in one group. Technically, the LFC, in this case, is infinity, however, due to convergence reasons, the algorithm returns a large LFC of around 20. I have considered introducing a threshold that sets large LFC values automatically to infinity but have so far shied away from it because I worry that this will only create more confusion.