a little question of G stat

huangli2924 commented 4 years ago

hi, I am so sorry for trouble you ! In script QTLseqr/R/G_functions.R when stat G obs <- c(LowRef, HighRef, LowAlt, HighAlt) G <-2 (rowSums(obs log(matrix(obs, ncol = 4) / matrix(exp, ncol = 4) )

so my question is "if the element of obs be equal to zero, how to stat log(0)?"

bmansfeld commented 4 years ago

Hi, This is an issue with the G statistic Im not sure was ever resolved. The equation is taken directly from Magwene et al., 2011. So it's not really my code error or something. However, I think that in general unless you have amazingly perfect bulking, this rarely happens in real data. And in any case you are smoothing over many SNPs and so they all shouldn't be perfectly zero. I've thought of adding 0.5 or something to the obs if they are zero but I didn't want to change the method. Let me know your thoughts. Ben

huangli2924 commented 4 years ago

hi,Ben Thanks for your so fast reply, And thank you for your code. Many of my friends are using it. It's very practical! I just want to discuss and learning with you. If I understand correctly, "LowRef" represents the "REF_depth" of a SNP site in the low pool. If any element in obs is equal to zer0 ,such as LowRef=0 , log (0) will occur; This is my unserstand of the formula ,I am not sure, Am I right? Thanks

bmansfeld commented 4 years ago

Thanks for the comments :-) I hope the package is useful to you all! This is a good discussion and I've been meaning to raise it with the authors of the original paper (Magwene et al., 2011) Here's two screenshots from the paper 1) with the equation for the G statisitc: 2) With the contingency table for n_i 1 -> 4

And my code for G stat: https://github.com/bmansfeld/QTLseqr/blob/5e761379a805b65038c415c8d3ce7aa61abe89dc/R/G_functions.R#L29-L44

The function takes in the allele depth values for the above table (ie n_i) and first calculates a vector of the expected values for the denominator (roughly half of the read depth). It then takes the observed values for each of n_i and puts it in the numerator. Thus if any n_i is exactly zero. log0 will occur as you say.

Again as far as I can see this is an imperfection with how Magwene et al implement the G statistic. Unless my code or interpretation of their formula is terribly wrong (in which case please let me know!)

I do however stand by my previous comments that due to Poisson noise in sequencing and imperfect bulking this is unlikely to happen in large chunks of SNPs. So that these NAN values eventually get ignored during the smoothing process of G.

There are options for zero-substitution procedures but after playing with some of them I found them to affect G in ways that depended on read depth etc. and so I opted to not implement them.

-Ben

huangli2924 commented 4 years ago

OK, you are such a responsible guy, ha ha I understand. Thank you very much.

bmansfeld commented 4 years ago

Great! no problem. I will leave the issue open in case anyone wants to contribute.

bmansfeld / QTLseqr

a little question of G stat #29