bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
626 stars 339 forks source link

Error converting summary statistics UKBiobank p-values #144

Open mattnev17 opened 5 years ago

mattnev17 commented 5 years ago

Hi, I've run into an issue with converting UKBiobank summary stats due to the extremely small p-values that I have a crude fix for but was wondering if you guys had already solved in your analyses.

Specifically when using munge_sumstats.py, traits with p-values below the np.float64 threshold of ~1.0E−308 were either giving warnings in the format: .WARNING: 43 SNPs had P outside of (0,1]. The P column may be mislabeled. or failing completely with the error:

Reading list of SNPs for allele merge from ../header_w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from UKBB_409K/blood_PLATELET_COUNT.error.sumstats.cut into memory 5000000 SNPs at a time.
../testMunge.py:746: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
  munge_sumstats(parser.parse_args(), p=True)
.
ERROR converting summary statistics:

Traceback (most recent call last):
  File "../testMunge.py", line 686, in munge_sumstats
    dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args)
  File "../testMunge.py", line 249, in parse_dat
    raise ValueError('Columns {} are expected to be numeric'.format(wrong_types))
ValueError: Columns ['P'] are expected to be numeric

I haven't had any luck trying to edit the python code but I was able to get around this by manually capping the p values at 1.0E-300.

    sed -i 's/[0-9].[0-9]E-[1-9][0-9][0-9][0-9]/1.0E-300/' UKBB_409K/${pheno}.sumstats
    sed -i 's/[0-9].[0-9]E-[3-9][0-9][0-9]/1.0E-300/' UKBB_409K/${pheno}.sumstats

I was wondering if you guys had already found a cleaner fix for this in your UKBiobank analyses?

Best, Matt

carbocation commented 4 years ago

Just ran into the same. Going to be hacking these down by an order of magnitude or so, I guess.

$ cut -f 10 bolt.filtered.tsv | sort -g | uniq -c | head
      1 P
      1 1.0E-3739
      1 8.9E-3696
      1 2.1E-2522
      1 9.0E-2499
      1 1.8E-2497
      1 4.8E-2468
      1 1.9E-2435
      1 2.9E-1902
      1 1.3E-1541
yk-tanigawa commented 4 years ago

We also run into the same issue.

It seems like the code uses the parsed P-value to compute the standardized beta with Scipy's inverse survival function for chi-squared distribution and I don't think scipy.stats.chi2.isf supports ultra-small p-value.

For now, I would also use Matt's fix but would love to learn if people find a way to properly handle it.

Thank you very much!

kkellysci commented 2 years ago

I've created a pull request https://github.com/bulik/ldsc/pull/319 with some code to fix the issue. The affected SNPs will be excluded, but munge_sumstats.py will exclude them without crashing. I think excluding them makes sense, since SNPs with 0 p-values are already excluded, and these SNPs have p-values that are so small they are indistinguishable from 0 when using the float64 data type.

carbocation commented 2 years ago

I've created a pull request #319 with some code to fix the issue. The affected SNPs will be excluded, but munge_sumstats.py will exclude them without crashing. I think excluding them makes sense, since SNPs with 0 p-values are already excluded, and these SNPs have p-values that are so small they are indistinguishable from 0 when using the float64 data type.

Wouldn't this lead to the removal of the most significant SNPs? In terms of trade-offs, my perspective is that it's better to under-estimate the effect (by faking a more modest P value) than to exclude these strongly associated SNPs.

I guess this means I'd prefer that this continue to crash, but perhaps the error message could include the possibility that the P value is not distinguishable from 0 by python as a reason it may be crashing.

kkellysci commented 2 years ago

@carbocation the way LD score regression works is to look at how a SNP's LD score (based on how many other SNPs that SNP tags) predicts its chi-square test statistic. Non-causal SNPs in LD with large numbers of other SNPs are more likely to happen to tag a causal SNP and have an inflated test statistic as a result. So it's the large numbers of SNPs with mildly inflated test statistics that make LD score regression work, not the small number that are extremely significant.

Extremely significant SNPs are usually excluded by LD Score regression because they have the potential to be highly-influential outliers (see chisq_max in ldscore/sumstats.py ). So it doesn't matter if munge_sumstats.py excludes them, because the regression would have excluded them anyway.

Even if you were to override that behavior and force LD Score regression to include highly-significant SNPs, you still wouldn't want to include SNPs with edited p-values. Picture a regression line on an X vs. Y scatter plot, where each dot is a SNP, X is the SNP's LD score, and Y is its chi-square test statistic from GWAS (remember, the lower the p-value the higher the chi-square). If the p-values of extremely significant SNPs were edited to make them less significant, that gives them a lower chi-square value than they're supposed to have. It would be like dragging some of the dots on the scatter plot downward, and could lead to inaccurate estimation of the slope and intercept of the line.

carbocation commented 2 years ago

@kkellysci Thanks, that's well stated. Basically, if we were to modify the software to handle arbitrarily extreme P values, those SNPs would be rejected by the chisq_max step anyway.

DC-Jade commented 11 months ago

I've created a pull request #319 with some code to fix the issue. The affected SNPs will be excluded, but munge_sumstats.py will exclude them without crashing. I think excluding them makes sense, since SNPs with 0 p-values are already excluded, and these SNPs have p-values that are so small they are indistinguishable from 0 when using the float64 data type.

thanks a lot, I modified munge_sumstats.py manually and then it worked, but the author didn't merge it

dqq0404 commented 5 months ago

@kkellysci Hi,I'm sorry to bother you, why I haven't solved the problem after using your method. The error content is: ERROR converting summary statistics:

Traceback (most recent call last): File ".../munge_sumstats.py", line 694, in munge_sumstats dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args) File ".../munge_sumstats.py", line 257, in parse_dat raise ValueError('Columns {} are expected to be numeric'.format(wrong_types)) ValueError: Columns ['P'] are expected to be numeric

Conversion finished at Fri Mar 15 09:36:05 2024 Total time elapsed: 1.0m:11.91s Traceback (most recent call last): File ".../munge_sumstats.py", line 754, in munge_sumstats(parser.parse_args(), p=True) File ".../munge_sumstats.py", line 694, in munge_sumstats dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args) File ".../munge_sumstats.py", line 257, in parse_dat raise ValueError('Columns {} are expected to be numeric'.format(wrong_types)) ValueError: Columns ['P'] are expected to be numeric

I used R to extract the minimum value of P, the smallest one is only e-12.