'beta' & 'SE' works but 'OR' & 'p-value' does not, how to preprocesss summary statistics (Neff)

char4816 commented 5 years ago

Thank you for making this software. I have been organizing summary stats to run through popcorn and have noticed that when I use 'beta' & 'SE', the fit returns some results (yay!). However, when I try to use the 'OR' & 'p-value' options as described in your documentation (which are the options I would prefer to use, given my summary stats files), the h2 outputs are always zero. For now I can use the 'beta' & 'SE' option, but I hope I can help you improve upon this issue so the software is even more adaptable to various sum stat formats.

A make-shift solution I found is using the equations from your odds_to_beta() function in sumstats.py to preprocess my files in R suppose you have a sumstat file with a header that is limited with the following following columns rs ref alt chr pos or pval sumstats <- read.table("orig_sumstats.txt", header=T, stringsAsFactor=F) sumstats$N <- 4/(1/nr.cases+1/nr.controls) sumstats$beta <- log(sumstats$or) sumstats$Z <- sign(sumstats$beta)*qnorm(1 - sumstats$pval/2) sumstats$SE <- sumstats$beta/sumstats$Z sumstats <- sumstats[,c("rs", "ref", "alt", "N", "beta", "SE")] colnames(asthmaAdult) <- c("SNP", "a1", "a2", "N", "beta", "SE") write.table(sumstats,"sumstats.txt", sep = "\t", col.names=T, row.names=F, quote=F)

Since many summary stats files do not have an "N" column, it can be difficult for users to distinguish the correct N to use. It doesn't always make sense to just sum nr.cases+nr.controls. Many people use the above equation for N when controls are in imbalance, for example:

The math behind the Neff calculation stems from the variance inflation factor (https://journals.sagepub.com/doi/pdf/10.1177/0163278703255230) between balanced and unbalanced studies. If you agree with me, you may want to advise users in your README to use this formula for N when it isn't provided by the summary stats and nr.cases != nr.controls

I hope this helps!

Thanks, Chris

char4816 commented 5 years ago

I thought my above solution would work, however I am still getting 0 results for the output after converting pvalue & or to beta & SE... I think this suggests that there may be an error in the odds_to_beta() function.

brielin commented 5 years ago

Thanks Chris, I'll look into this today and let you know what I come up with.

brielin commented 5 years ago

Quick comment: N_cases + N_controls is the value you want for N, not Neff. As noted in the link to the LDSC issue you posted, the conversion to the liability scale handles the imbalance between cases and controls.

char4816 commented 5 years ago

Thank you for your help and taking the time to collaborate on this. In the ldsc issue they describe how it is roughly equivalent to 1) use N=Cases+Controls and the appropriate --samp-prev or to 2) use Neff = 4/(1/nr.cases+1/nr.controls) with --samp-prev 0.5

I am interpreting that nr.cases+nr.controls is correct as long as the user specifies prevalence values for the disease so that conversion to liability scale is accurate. Am I correct in this interpretation of the fit variables? --K1 = "population prevelance of binary phenotype 1" (prevalence of phen1 in literature) --P1 = "study prevelance of binary phenotype 1" (nr.cases/(nr.cases+nr.controls) for sfile1) --K2 = "population prevelance of binary phenotype 2" (prevalence of phen2 in literature) --P2 = "study prevelance of binary phenotype 2" (nr.cases/(nr.cases+nr.controls) for sfile2)

brielin commented 5 years ago

Yes, exactly. I'm updating the readme in the next push to clarify. Still looking into the OR/beta issue.

char4816 commented 5 years ago

Thank you! I am realizing I should have said --P1 = "study prevelance of binary phenotype 1" (nr.cases/(nr.controls + nr.controls)for sfile1) --P2 = "study prevelance of binary phenotype 1" (nr.cases/(nr.controls + nr.controls)for sfile2) that way --P1 = 0.5 when nr.cases and nr.controls are balanced I'll edit the above comment to correct this

brielin commented 5 years ago

I'm struggling to find any issue in the odds ratio conversion. When you say you get the beta and SE version to work, is this the same trait? Always getting 0 tells me it could be a few things:

Something is going wrong in the conversion step and all of the Z-scores/betas are getting set to 0 or something weird. When you run the R code you have, is this happening? It doesn't seem to happen for me on the python side.
The heritability is actually very low and the MLE is 0. I just pushed an update which changes the default behavior to use regression instead, but on the older version you can use "--use_regression" to use a simpler model instead of MLE.

Otherwise, if you don't mind sharing your sumstats file (you can send me an email if you like) I can try to debug more directly.

char4816 commented 5 years ago

I have a good example of summary stats files and will follow up via email in a few minutes

brielin commented 5 years ago

Fixed in the latest update. Conversion was resulting in NaNs/infs that needed to be filtered.

brielin / Popcorn

'beta' & 'SE' works but 'OR' & 'p-value' does not, how to preprocesss summary statistics (Neff) #7