bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
628 stars 340 forks source link

ERROR computing rg for phenotype 2/2, FloatingPointError: invalid value encountered in sqrt #54

Closed EmmaO94 closed 7 years ago

EmmaO94 commented 8 years ago

I'm trying to run LD regression and have been following the tutorial on here https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation

I have downloaded python and required packages through Anaconda and have also downloaded ldsc, I have converted my summary statistic .txt files to the ldsc .sumstats.gz format and proceeded to follow the command: ./ldsc.py \ --ref-ld-chr eur_w_ld_chr/ \ --out LDregression3 \ --rg FemaleStats.sumstats.gz,MaleStats.sumstats.gz \ --w-ld-chr eur_w_ld_chr/

However I receive this error message in my log:

Beginning analysis at Thu Jun 16 17:54:28 2016 Reading summary statistics from FemaleStats.sumstats.gz ... Read summary statistics for 507689 SNPs. Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... Read reference panel LD Scores for 1293150 SNPs. Removing partitioned LD Scores with zero variance. Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... Read regression weight LD Scores for 1293150 SNPs. After merging with reference panel LD, 507161 SNPs remain. After merging with regression SNP LD, 507161 SNPs remain. Computing rg for phenotype 2/2 Reading summary statistics from MaleStats.sumstats.gz ... Read summary statistics for 1217311 SNPs. After merging with summary statistics, 507161 SNPs remain. 507062 SNPs with valid alleles. ERROR computing rg for phenotype 2/2, from file MaleStats.sumstats.gz. Traceback (most recent call last): File "C:\Users\Emma\Anaconda2\ldscore\sumstats.py", line 343, in estimate_rg rghat = _rg(loop, args, log, M_annot, ref_ld_cnames, w_ld_cname, i) File "C:\Users\Emma\Anaconda2\ldscore\sumstats.py", line 470, in _rg intercept_gencov=intercepts[2], n_blocks=n_blocks, twostep=args.two_step) File "C:\Users\Emma\Anaconda2\ldscore\regressions.py", line 699, in init np.multiply(hsq1.tot_delete_values, hsq2.tot_delete_values)) FloatingPointError: invalid value encountered in sqrt

Summary of Genetic Correlation Results p1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se FemaleStats.sumstats.gz MaleStats.sumstats.gz NA NA NA NA NA NA NA NA NA NA

Analysis finished at Thu Jun 16 17:55:14 2016 Total time elapsed: 45.4s

I have asked for help from several forums and my supervisors on my placement however none of them have been able to, I have checked my sumstats files and there are no NA, missing or negative values, all values are in decimal form e.g. 1.0 instead of 1. All the program files are in the same directory as python and ldsc.

Can anyone help me in understanding what this error means and what I can do?

Many Thanks, Emma

rkwalters commented 8 years ago

Hi Emma, Having no negative values in the sumstats file is highly unusual since we expect Z scores to be normally distributed on either side of 0. Are you using munge_sumstats.py with the w_hm3.snplist file to generate these sumstats files? Logs from munge_sumstats.py may be informative, if available.

More directly to your error message, as mentioned on the ldsc_users group that error indicates that there are negative estimates of heritability observed for one of the traits when computing the jackknife standard errors, which normally this indicates that at least one of the input results has very low heritability. Have you had a chance to look into the univariate ldsc rusults for each file to follow up on this possibility?

Cheers, Raymond

On Jun 21, 2016, at 1:21 AM, EmmaO94 notifications@github.com wrote:

I'm trying to run LD regression and have been following the tutorial on here https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation

I have downloaded python and required packages through Anaconda and have also downloaded ldsc, I have converted my summary statistic .txt files to the ldsc .sumstats.gz format and proceeded to follow the command: ./ldsc.py \ --ref-ld-chr eur_w_ld_chr/ \ --out LDregression3 \ --rg FemaleStats.sumstats.gz,MaleStats.sumstats.gz \ --w-ld-chr eur_w_ld_chr/

However I receive this error message in my log:

Beginning analysis at Thu Jun 16 17:54:28 2016 Reading summary statistics from FemaleStats.sumstats.gz ... Read summary statistics for 507689 SNPs. Reading reference panel LD Score from eur_w_ld_chr/[1-22] ... Read reference panel LD Scores for 1293150 SNPs. Removing partitioned LD Scores with zero variance. Reading regression weight LD Score from eur_w_ld_chr/[1-22] ... Read regression weight LD Scores for 1293150 SNPs. After merging with reference panel LD, 507161 SNPs remain. After merging with regression SNP LD, 507161 SNPs remain. Computing rg for phenotype 2/2 Reading summary statistics from MaleStats.sumstats.gz ... Read summary statistics for 1217311 SNPs. After merging with summary statistics, 507161 SNPs remain. 507062 SNPs with valid alleles. ERROR computing rg for phenotype 2/2, from file MaleStats.sumstats.gz. Traceback (most recent call last): File "C:\Users\Emma\Anaconda2\ldscore\sumstats.py", line 343, in estimate_rg rghat = rg(loop, args, log, M_annot, ref_ld_cnames, w_ld_cname, i) File "C:\Users\Emma\Anaconda2\ldscore\sumstats.py", line 470, in rg intercept_gencov=intercepts[2], n_blocks=n_blocks, twostep=args.two_step) File "C:\Users\Emma\Anaconda2\ldscore\regressions.py", line 699, in __init np.multiply(hsq1.tot_delete_values, hsq2.tot_delete_values)) FloatingPointError: invalid value encountered in sqrt

Summary of Genetic Correlation Results p1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se FemaleStats.sumstats.gz MaleStats.sumstats.gz NA NA NA NA NA NA NA NA NA NA

Analysis finished at Thu Jun 16 17:55:14 2016 Total time elapsed: 45.4s

I have asked for help from several forums and my supervisors on my placement however none of them have been able to, I have checked my sumstats files and there are no NA, missing or negative values, all values are in decimal form e.g. 1.0 instead of 1. All the program files are in the same directory as python and ldsc.

Can anyone help me in understanding what this error means and what I can do?

Many Thanks, Emma

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

EmmaO94 commented 8 years ago

I'm not using z scores I'm using OR values, the tutorial states that summary stats should contain- A unique identifier (e.g., the rs number) Allele 1 (effect allele) Allele 2 (non-effect allele) Sample size (which often varies from SNP to SNP) A P-value A signed summary statistic (beta, OR, log odds, Z-score, etc)

I used the following command to convert my summary stats to the ldsc format- ./munge_sumstats.py \ --out MaleStats \ --merge-alleles w_hm3.snplist \ --sumstats NewMaleSumStat.txt

Below is the log of one of the conversions of my summary stats files

FemaleStats.txt

Below is the results of one of the heritability tests

FemHeritability.txt

Many Thanks, Emma

rkwalters commented 8 years ago

Hi Emma, Thanks for the log files, they’re very helpful. Are a couple things here that are noteworthy:

1) From munge_sumstats.py it looks like you have loci with some very substantial effects (max chi2 > 400). Loci like this are a strong departure from the purely polygenic model used by ldsc, and thus can hurt the stability of LD score regression results. There’s some effort to automatically compensate for this in ldsc, but it may be worth manually removing your highly significant loci and re-running the analysis to see if that gives you more stable results (i.e. smaller SEs).

2) Your dataset is very much pushing the limits of sample size and heritability for valid ldsc analysis of genetic correlation. N=4400 is on the low end of what works for ldsc, especially for dichotomous phenotypes, and the univariate heritability of .07 is similarly borderline. For comparison, LDHub restrict to looking at genetic correlation in GWAS with heritability Z-score (h2g / SE) above 2, and the original Nature Genetics paper restricted to Z-score > 4, compared to .0746/.0656 = 1.14 for your results.

If point #1 above doesn’t help stabilize your univariate results it’s possible that there simply isn’t enough signal in your data for a reliable genetic correlation analysis with ldsc (which would be consistent with the error message you are seeing). Given that you have highly significant loci however, a comparison of effect sizes for those loci (or a broader comparison with polygenic risk scores) may be a satisfactory alternative to ldsc with your data.

Cheers, Raymond

On Jun 21, 2016, at 7:24 PM, EmmaO94 notifications@github.com wrote:

I'm not using z scores I'm using OR values, the tutorial states that summary stats should contain- A unique identifier (e.g., the rs number) Allele 1 (effect allele) Allele 2 (non-effect allele) Sample size (which often varies from SNP to SNP) A P-value A signed summary statistic (beta, OR, log odds, Z-score, etc)

I used the following command to convert my summary stats to the ldsc format- ./munge_sumstats.py \ --out MaleStats \ --merge-alleles w_hm3.snplist \ --sumstats NewMaleSumStat.txt

Below is the log of one of the conversions of my summary stats files

FemaleStats.txt

Below is the results of one of the heritability tests

FemHeritability.txt

Many Thanks, Emma

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

EmmaO94 commented 8 years ago

Raymond,

Removing my top SNPs worked wonderfully, thankyou so much!

Many Thanks, Emma

guruprem commented 7 years ago

Hi, Thanks for the nice explanation. I am facing a similar situation. Please find the log file attached. From your comments to Emma, It looks like I am not facing the first situation and also not limited by the N. But the mean chi2 is just above (1.03) the requested limit of 1.02. Could this be the cause of the error that I am seeing, that there are not enough polygenic signal.

Note: I am not removing any GW hits.

Interpreting column names as follows: snpid: Variant ID (e.g., rs number) N: Sample size a1: Allele 1, interpreted as ref allele for signed sumstat. P: p-Value a2: Allele 2, interpreted as non-ref allele for signed sumstat. Z: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)

Reading list of SNPs for allele merge from w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from inputsformunge_nogcP/charge_as.ldsc.input into memory 5000000.0 SNPs at a time. Read 7807078 SNPs from --sumstats file. Removed 6642315 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 1164763 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1164763 SNPs remain). Removed 64625 SNPs with N < 56640.6666667 (1100138 SNPs remain). Median value of Z was 0.00941176, which seems sensible. Removed 0 SNPs whose alleles did not match --merge-alleles (1100138 SNPs remain). Writing summary statistics for 1217311 SNPs (1100138 with nonmissing beta) to sumstats/CHARGE-AS_FINAL.ldsc.sumstats.gz.

Metadata: Mean chi^2 = 1.039 Lambda GC = 1.042 Max chi^2 = 22.168 0 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Fri Jan 27 17:13:42 2017 Total time elapsed: 57.22s

rkwalters commented 7 years ago

Hi,

Lack of polygenic signal definitely seems like a possibility here. Are you just seeing errors in --rg, or do you get errors from univariate estimation of heritability (--h2) as well?

Cheers, Raymond

On Jan 27, 2017, at 11:32 AM, guruprem notifications@github.com wrote:

Hi, Thanks for the nice explanation. I am facing a similar situation. Please find the log file attached. From your comments to Emma, It looks like I am not facing the first situation and also not limited by the N. But the mean chi2 is just above (1.03) the requested limit of 1.02. Could this be the cause of the error that I am seeing, that there are not enough polygenic signal.

Note: I am not removing any GW hits.

Interpreting column names as follows: snpid: Variant ID (e.g., rs number) N: Sample size a1: Allele 1, interpreted as ref allele for signed sumstat. P: p-Value a2: Allele 2, interpreted as non-ref allele for signed sumstat. Z: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)

Reading list of SNPs for allele merge from w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from inputsformunge_nogcP/charge_as.ldsc.input into memory 5000000.0 SNPs at a time. Read 7807078 SNPs from --sumstats file. Removed 6642315 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 1164763 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1164763 SNPs remain). Removed 64625 SNPs with N < 56640.6666667 (1100138 SNPs remain). Median value of Z was 0.00941176, which seems sensible. Removed 0 SNPs whose alleles did not match --merge-alleles (1100138 SNPs remain). Writing summary statistics for 1217311 SNPs (1100138 with nonmissing beta) to sumstats/CHARGE-AS_FINAL.ldsc.sumstats.gz.

Metadata: Mean chi^2 = 1.039 Lambda GC = 1.042 Max chi^2 = 22.168 0 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Fri Jan 27 17:13:42 2017 Total time elapsed: 57.22s

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bulik/ldsc/issues/54#issuecomment-275708858, or mute the thread https://github.com/notifications/unsubscribe-auth/AILEvdDQuaCA0e1XV73UUX3Jm7sbnD85ks5rWhwDgaJpZM4I50AX.

guruprem commented 7 years ago

Thanks for the reply raymond, Yes, the error is throwing up only for the --rg, univariate estimation seems to be working fine. Surprisingly, the error is seen for all those trait combinations with NA, except for one (T1 vs T2). Note: the log file mentioned above is from T1

image

rkwalters commented 7 years ago

Hi, In this case lack of polygenic signal is the clear culprit. Trait 1’s h2=.003 is very small (I’m guessing it’s not significantly different from zero?), and not likely to be a good fit for LD score analyses.

Realistically, with h2 that low I’m actually less surprised by the errors for T1 vs T4-T7 than surprised that you managed to get the T1 vs. T2 estimate without the error.

Cheers, Raymond

On Jan 30, 2017, at 11:41 AM, guruprem notifications@github.com wrote:

Thanks for the reply raymond, Yes, the error is throwing up only for the --rg, univariate estimation seems to be working fine. Surprisingly, the error is seen for all those trait combinations with NA, except for one (T1 vs T2). Note: the log file mentioned above is from T1

https://cloud.githubusercontent.com/assets/23656302/22431840/3ba494e4-e713-11e6-9914-331b1af729d6.png — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bulik/ldsc/issues/54#issuecomment-276114462, or mute the thread https://github.com/notifications/unsubscribe-auth/AILEvf1CbZ1MwdbBJUB84RvpaNLlxXcgks5rXhLUgaJpZM4I50AX.

rkwalters commented 7 years ago

Hi, I'm closing this issue thread as resolved since it's been more than 2 months since the last comment, but if you still have questions about this or any other issues feel free to follow up here or via the google group. Cheers, Raymond