const-ae / glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably
105 stars 15 forks source link

determining the less trustworthy log2fc values #27

Open ceesu opened 3 years ago

ceesu commented 3 years ago

Hello, thanks very much for your package. I just want to follow up on this point from the vignette:

The large lfc values come from groups were nearly all counts are 0

It seems that depending on what my design is, the threshold to separate the "three groups" of log2fc values can be as small as 5. I also got the warnings "“encountered non-positive size factor estimates” and “singular gradient” when I was running glm_gp for the fit, I don't know if it's related. I'm assuming these are still "large lfc values" though they are < 20. Is there a better way you could recommend to separate out the genes with less trustworthy log2fc values than by looking visually?

const-ae commented 3 years ago

Hi Cathy,

that is a fair question. If you could provide a reproducible, I am happy to discuss specifics of the issues that you encountered. But I will try to give some pointers which are hopefully already useful:

I also got the warnings "“encountered non-positive size factor estimates”

This is a warning generated by scran::computeSumFactors. It might suggest that you have a wide range for the number of reads assigned to each cell. Do you do some quality control to remove poor quality cells?

I also got the warnings [...] and “singular gradient” when I was running glm_gp for the fit

That warning is interesting, as I am not sure where it is coming from. Here, I would need a reproducible example to say more.

Is there a better way you could recommend to separate out the genes with less trustworthy log2fc values than by looking visually?

In my opinion the p-value associated with a log2fc is still the best measure to understand credible a certain change is. By default the p-value is calculated with a likelihood ratio test. However, you might also be interested in this earlier discussion about using the standard error associated with each coefficient fit as an alternative. For more details see https://github.com/const-ae/glmGamPoi/issues/12.

Best, Constantin

ceesu commented 3 years ago

Sorry for this late response. I performed some filtering which may have dealt with the errors of “encountered non-positive size factor estimates” and “singular gradient” for now.

However I am actually thinking about a case such as #22 because my plots are similar distribution, and in that case p-value Is not always useful as a filter. In that issue it's suggested to do something such as set all LFC above 15 to Inf. However I've found sometimes the threshold as determined by eye is smaller than 15. Do you have any suggestions for how I can discard lfc values from the two extremes of this 'pattern' systematically without looking by eye?

Thanks!

const-ae commented 3 years ago

Hi Cathy,

thanks for reaching out again and for your feedback :)

However I am actually thinking about a case such as #22 because my plots are similar distribution, and in that case p-value Is not always useful as a filter. In that issue it's suggested to do something such as set all LFC above 15 to Inf

Can you explain a bit more why the p-values are not a good filter? Note that the recommendation to change LFC > 15 to Inf is just for plotting. It uses the trick that ggplot automatically plots values with infinity on the boundary of the plot, which makes the plot look nicer.

However I've found sometimes the threshold as determined by eye is smaller than 15. Do you have any suggestions for how I can discard lfc values from the two extremes of this 'pattern' systematically without looking by eye?

Good question. Unfortunately, not really right now. The cause of the extreme LFC is that the parameter estimation algorithm converges to an extreme value if one of the groups consists of only zeros and the other group has non-zero counts. One option would be to specifically filter for such cases, but that can get quite complicated for more complex models.

Best, Constantin

ceesu commented 3 years ago

Thanks for your reply!

Can you explain a bit more why the p-values are not a good filter? Note that the recommendation to change LFC > 15 to Inf is just for plotting. It uses the trick that ggplot automatically plots values with infinity on the boundary of the plot, which makes the plot look nicer.

My thinking is that since for some of these genes because the counts are much smaller in one group, the lfc might not be trustworthy even if the p-value is very small (which I am seeing sometimes). I guess this should be partly dealt with by filtering but as you mention it's complicated to perform this filtering to account for multiple types of groups.