bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
625 stars 339 forks source link

fatal error while running munge_sumstats.py #43

Open tushardave26 opened 8 years ago

tushardave26 commented 8 years ago

Hello,

After successful installation of ldsc on my system, I tried to run munge_sumstats.py script to format my input files to ldsc format files. I thought this will be a smooth process but it's not. While running munge_sumstats.py script, I am facing a fatal error about median values of beta. For reference, I have pasted the script and the fatal error below:

SCRIPT: python munge_sumstats.py \ --sumstats inFile.txt \ --N 18759 \ --out out \ --merge-alleles hapmap.txt

ERROR: ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname)) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2))) ValueError: WARNING: median value of beta is -3.6 (should be close to 0). This column may be mislabeled.

After this exercise, I thought that my input file has negative beta values that might cause the error. So I converted all negative values to positive and ran the same script. Unfortunately, the same error occurred. For reference, I have pasted the error below:

ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname)) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2))) ValueError: WARNING: median value of beta is 3.6 (should be close to 0). This column may be mislabeled.

I hope that I am clear enough in explaining my issue.

Can someone elaborate me that why am I getting this error?

Thanks.

rkwalters commented 8 years ago

Hi Tushar, You are getting an error because your median beta is far from 0. Having both positive and negative values is fine (and expected!), but we normally expect the distribution of betas to be centered at the null value of 0.

The fix for this error will depend on your data, but possible things to check include:

As an aside, there’s a ldsc users group (https://groups.google.com/forum/#!forum/ldsc_users) that reaches a larger number of people and is great for troubleshooting these kinds of questions. It also helps us separate initial troubleshooting / usage questions / discussion from confirmed bugs/etc on the github issue tracker for future ldsc versions. Not a big deal and I’m still happy to help here, but something to know for future reference.

Cheers, Raymond

On Feb 26, 2016, at 2:17 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Hello,

After successful installation of ldsc on my system, I tried to run munge_sumstats.py script to format my input files to ldsc format files. I thought this will be a smooth process but it's not. While running munge_sumstats.py script, I am facing a fatal error about median values of beta. For reference, I have pasted the script and the fatal error below:

SCRIPT: python munge_sumstats.py \ --sumstats inFile.txt \ --N 18759 \ --out out \ --merge-alleles hapmap.txt

ERROR: ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname)) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2))) ValueError: WARNING: median value of beta is -3.6 (should be close to 0). This column may be mislabeled.

After this exercise, I thought that my input file has negative beta values that might cause the error. So I converted all negative values to positive and ran the same script. Unfortunately, the same error occurred. For reference, I have pasted the error below:

ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname)) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2))) ValueError: WARNING: median value of beta is 3.6 (should be close to 0). This column may be mislabeled.

I hope that I am clear enough in explaining my issue.

Can someone elaborate me that why am I getting this error?

Thanks.

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Thanks Raymond for your prompt response. And I apologize if I am not asking my doubts at the right location. In future, I will ask my questions on the google group that you have mentioned.

Thanks for the possible fix that you have provided. Below are my answers to your fixes:

But now, I am having another error with my other GWAS data file. Below is the script and the error:

SCRIPT: python munge_sumstats.py \ --sumstats inFile.txt \ --N 49324 \ --snp SNP \ --a1 CODED_ALLELE \ --a2 NONCODED_ALLELE \ --p P_FIXED \ --frq CODED_ALLELE_FREQ \ --out out \ --merge-alleles hapmap.txt

ERROR: ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 639, in munge_sumstats dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 281, in parse_dat dat.A1 = dat.A1.str.upper() File "/local/projects-t2/SIGN/packages/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 2360, in getattr (type(self).name, name)) AttributeError: 'DataFrame' object has no attribute 'str'

Any possible solution for this error?

Thanks for your help.

rkwalters commented 8 years ago

Hi Tushar, Prior to the error you should have a block of output starting with "Interpreting column names as follows:� and then listing the column headers in use. Is you column for allele A1 correctly listed there? (posting the full log file might also be helpful here)

If that�s ok, the next thing I�d check is whether you have any SNPs where the entry for A1 is empty/missing.

Cheers, Raymond

On Feb 26, 2016, at 4:15 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Thanks Raymond for your prompt response. And I apologize if I am not asking my doubts at the right location. In future, I will ask my questions on the google group that you have mentioned.

Thanks for the possible fix that you have provided. Below are my answers to your fixes:

� Yes, I am using complete GWAS data with 123,026 SNPs. � I don't have any idea about the beta values as I got from my collaborator. So I skip this option for now. � I used --a1-inc option as all beta values are negative.The error got resolved. But now, I am having another error with my other GWAS data file. Below is the script and the error:

SCRIPT: python munge_sumstats.py \ --sumstats inFile.txt \ --N 49324 \ --snp SNP \ --a1 CODED_ALLELE \ --a2 NONCODED_ALLELE \ --p P_FIXED \ --frq CODED_ALLELE_FREQ \ --out out \ --merge-alleles hapmap.txt

ERROR: ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 639, in munge_sumstats dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 281, in parse_dat dat.A1 = dat.A1.str.upper() File "/local/projects-t2/SIGN/packages/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 2360, in getattr (type(self).name, name)) AttributeError: 'DataFrame' object has no attribute 'str'

Any possible solution for this error?

Thanks for your help.

� Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Hello Raymond,

Below is the log file content:


Interpreting column names as follows: P_FIXED: p-Value CODED_ALLELE_FREQ: Allele frequency A1: Allele 1, interpreted as ref allele for signed sumstat. beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing) A2: Allele 2, interpreted as non-ref allele for signed sumstat. SNP: Variant ID (e.g., rs number) NONCODED_ALLELE: Allele 2, interpreted as non-ref allele for signed sumstat. CODED_ALLELE: Allele 1, interpreted as ref allele for signed sumstat.

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out into memory 5000000.0 SNPs at a time.

ERROR converting summary statistics:

Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 639, in munge_sumstats dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args) File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 281, in parse_dat dat.A1 = dat.A1.str.upper() File "/local/projects-t2/SIGN/packages/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 2360, in getattr (type(self).name, name)) AttributeError: 'DataFrame' object has no attribute 'str'

bulik commented 8 years ago

This looks like it could be an issue with pandas? Could you try running the following tiny python script:

import pandas as pd
print pd.__version__

If you're running an old version of pandas (below 0.16), try updating pandas (e.g., using pip or conda)

tushardave26 commented 8 years ago

Hello Brendan,

I don't think that it's a package version issue as the installed package versions are greater than the required ones.

tushardave26 commented 8 years ago

Hello Raymond and Brendan,

I found the problem and it is column name issue. I haven't notice that my file has columns with A1 and A2 labels, so I specified explicitly them as CODED_ALLELE and NONCODED_ALLELE because my file has them. The script has identified all the four columns and probably got confused while processing the file. So, I removed the line from my script which defining the column names explicitly and the script just run fine.

I have another question for you both. With --merge-alleles option, I am using HapMap SNP list file as suggested in the wiki. I am just wondering that does it ok if I use HapMap SNP list file for the data that were imputed using 1000Genome reference panels. I am asking this question because I losing a lot of SNPs because they are not present in HapMaP file. Below log lines will help you understand my point:

Read 10727831 SNPs from --sumstats file. Removed 9528559 SNPs not in --merge-alleles.

What are your thoughts on this?

Thanks for your help.

tushardave26 commented 8 years ago

Is there any way to print out the removed SNPs in another file?

rkwalters commented 8 years ago

Hi Tushar, There’s not currently an option to print the removed SNPs, but all of the SNPs that are kept can be found in the output .sumstats.gz and compared to your input file to determine what was excluded.

To your earlier question about --merge-alleles, the large number of removed SNPs is expected here. If you look at the w_hm3.snplist file it only contains ~1.2 million SNPs, so it looks like you have nearly all of them and then the remaining 9 million non-HapMap SNPs are excluded. Even without the SNP list, many of these would probably be excluded by the MAF and info score filters regardless.

The HapMap3 SNP list is applicable to 1000 Genomes imputed data. The list provides a robust set of common SNPs that is reliably imputed across studies which makes it a good baseline standard, especially for analyzing genetic correlation.

Cheers, Raymond

On Feb 27, 2016, at 4:27 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Is there any way to print out the removed SNPs in another file?

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Thanks Raymond for your explanation.

My curiosity (about ldsc) is not ending. Because of that I have another question for you. Below find the log file for my second summary statistics file:


Interpreting column names as follows: P_FIXED: p-Value CODED_ALLELE_FREQ: Allele frequency A1: Allele 1, interpreted as ref allele for signed sumstat. beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing) A2: Allele 2, interpreted as non-ref allele for signed sumstat. SNP: Variant ID (e.g., rs number)

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out into memory 5000000.0 SNPs at a time. Read 10727831 SNPs from --sumstats file. Removed 9528559 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 9528694 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 2 variants that were not SNPs or were strand-ambiguous. 1199135 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1199135 SNPs remain). Using N = 49324.0 Median value of beta was -0.0002, which seems sensible. Removed 9 SNPs whose alleles did not match --merge-alleles (1199126 SNPs remain). Writing summary statistics for 1217311 SNPs (1199126 with nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz.

Metadata: Mean chi^2 = 1.04 Lambda GC = 1.039 Max chi^2 = 41.004 9 Genome-wide significant SNPs (some may have been removed by filtering).

If you see, the log line just above the line with Metadata text says "Writing summary statistics for 1217311 SNPs (1199126 with the nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz". How can the numbers in that line be different from each other? Also, when I count the lines of the result files I got 1217311. In fact, for my both the data set I got the same number (i.e. 1217311) of lines in their respective result files? Why is that so? Why there are blanks column values in the result files? Is it because they are either removed or not found in the HapMap 3 file?

As next step to my analysis, I am running LD Score regression. I am using the same script which is explained in the wiki (but with my input files). The question for this analysis is that does the algorithm will run LD regression on the common SNPs only? I meant the SNPs which are present in both the data files.

I hope my questions are clear enough.

Thanks again for your help.

rkwalters commented 8 years ago

Hi Tushar,

For ease of comparison across datasets, --merge-alleles prints output to sumstats.gz for all SNPs in the --merge-alleles file. SNPs that aren’t present in your input data will be output with missing values. So using w_hm3.snplist you will always have 1217311 SNPs in the output, and in this case your input data has results for 1199126 of those SNPs (after filtering).

The log for ldsc should indicate the number of SNPs remaining in the regression (i.e. after merging SNPs present in the file of reference panel LD scores).

Cheers, Raymond

On Feb 29, 2016, at 2:39 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Thanks Raymond for your explanation.

My curiosity (about ldsc) is not ending. Because of that I have another question for you. Below find the log file for my second summary statistics file:

LD Score Regression (LDSC) Version 1.0.0 (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane Broad Institute of MIT and Harvard / MIT Department of Mathematics GNU General Public License v3 ***** Call: ./munge_sumstats.py \ --out /local/projects-t2/SIGN/analysis/depression/LDSC/sign \ --merge-alleles /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist \ --frq CODED_ALLELE_FREQ \ --N 49324.0 \ --snp SNP \ --sumstats /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out \ --p P_FIXED Interpreting column names as follows: P_FIXED: p-Value CODED_ALLELE_FREQ: Allele frequency A1: Allele 1, interpreted as ref allele for signed sumstat. beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing) A2: Allele 2, interpreted as non-ref allele for signed sumstat. SNP: Variant ID (e.g., rs number)

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out into memory 5000000.0 SNPs at a time. Read 10727831 SNPs from --sumstats file. Removed 9528559 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 9528694 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 2 variants that were not SNPs or were strand-ambiguous. 1199135 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1199135 SNPs remain). Using N = 49324.0 Median value of beta was -0.0002, which seems sensible. Removed 9 SNPs whose alleles did not match --merge-alleles (1199126 SNPs remain). Writing summary statistics for 1217311 SNPs (1199126 with nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz.

Metadata: Mean chi^2 = 1.04 Lambda GC = 1.039 Max chi^2 = 41.004 9 Genome-wide significant SNPs (some may have been removed by filtering).

If you see, the log line just above the line with Metadata text says "Writing summary statistics for 1217311 SNPs (1199126 with the nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz". How can the numbers in that line be different from each other? Also, when I count the lines of the result files I got 1217311. In fact, for my both the data set I got the same number (i.e. 1217311) of lines in their respective result files? Why is that so? Why there are blanks column values in the result files? Is it because they are either removed or not found in the HapMap 3 file?

As next step to my analysis, I am running LD Score regression. I am using the same script which is explained in the wiki (but with my input files). The question for this analysis is that does the algorithm will run LD regression on the common SNPs only? I meant the SNPs which are present in both the data files.

I hope my questions are clear enough.

Thanks again for your help.

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Hello, Raymond

If I understand you correctly, not matter how many SNPs are left over after filtration process I will always get the same number of SNPs (i.e. 1217311) in the output file.

Can I use 1000Genome SNPs instead of HapMap3? If so, can you point me to the website from where I can download it?

How reliable the LD regression results are if the number of SNPs < 200K? Unfortunately, I have ~114K SNPs and scripts prints out WARNING message.

Thanks.

tushardave26 commented 8 years ago

Hello Raymond,

When I tried to convert the Heritability Observed Score to liability score I got following error:


Beginning analysis at Mon Feb 29 16:51:27 2016 Reading summary statistics from /local/projects-t2/SIGN/analysis/depression/LDSC/depression.sumstats.gz ... Read summary statistics for 114486 SNPs. Reading reference panel LD Score from /local/projects-t2/SIGN/analysis/depression/LDSC/eur_w_ld_chr/[1-22] ... Read reference panel LD Scores for 1293150 SNPs. Removing partitioned LD Scores with zero variance. Reading regression weight LD Score from /local/projects-t2/SIGN/analysis/depression/LDSC/eur_w_ld_chr/[1-22] ... Read regression weight LD Scores for 1293150 SNPs. After merging with reference panel LD, 114456 SNPs remain. After merging with regression SNP LD, 114456 SNPs remain. Computing rg for phenotype 2/2 Reading summary statistics from /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz ... Read summary statistics for 1217311 SNPs. After merging with summary statistics, 114456 SNPs remain. 114445 SNPs with valid alleles. WARNING: number of SNPs less than 200k; this is almost always bad.

Heritability of phenotype 1

ERROR computing rg for phenotype 2/2, from file /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz. Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/ldscore/sumstats.py", line 345, in estimate_rg _print_gencor(args, log, rghat, ref_ld_cnames, i, rg_paths, i == 0) File "/local/projects-t2/SIGN/packages/ldsc/ldscore/sumstats.py", line 417, in _print_gencor log.log(rghat.hsq1.summary(ref_ld_cnames, P=P[0], K=K[0])) File "/local/projects-t2/SIGN/packages/ldsc/ldscore/regressions.py", line 442, in summary c = h2_obs_to_liab(1, P, K) File "/local/projects-t2/SIGN/packages/ldsc/ldscore/regressions.py", line 130, in h2_obs_to_liab raise ValueError('K must be in the range (0,1)') ValueError: K must be in the range (0,1)

Can you please help me to solve this?

Thanks for your help.

rkwalters commented 8 years ago

Hi Tushar, For the observed/liability conversion, your error is coming from having population prevalences > 1. Perhaps you want --pop-prev .076,.026 rather than 7.6 and 2.6?

Having only ~114K is unexpected and potentially problematic. Have you checked that everything is working properly with munge_sumstats.py for that dataset?

Using 1000 Genomes SNPs is fine. 1000 Genomes data is available http://www.1000genomes.org/data. I’d recommend consulting the Methods sections for each of the first 3 LD score papers listed under citation in the README for a number of additional comments about choosing SNPs to include in the regression.

Cheers, Raymond

On Feb 29, 2016, at 4:54 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Hello Raymond,

When I tried to convert the Heritability Observed Score to liability score I got following error:

LD Score Regression (LDSC) Version 1.0.0 (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane Broad Institute of MIT and Harvard / MIT Department of Mathematics GNU General Public License v3 ***** Call: ./ldsc.py \ --ref-ld-chr /local/projects-t2/SIGN/analysis/depression/LDSC/eur_w_ld_chr/ \ --out /local/projects-t2/SIGN/analysis/depression/LDSC/depression_sign_test \ --rg /local/projects-t2/SIGN/analysis/depression/LDSC/depression.sumstats.gz,/local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz \ --pop-prev 7.6,2.6 \ --samp-prev 0.49,0.34 \ --w-ld-chr /local/projects-t2/SIGN/analysis/depression/LDSC/eur_w_ld_chr/ Beginning analysis at Mon Feb 29 16:51:27 2016 Reading summary statistics from /local/projects-t2/SIGN/analysis/depression/LDSC/depression.sumstats.gz ... Read summary statistics for 114486 SNPs. Reading reference panel LD Score from /local/projects-t2/SIGN/analysis/depression/LDSC/eur_w_ld_chr/[1-22] ... Read reference panel LD Scores for 1293150 SNPs. Removing partitioned LD Scores with zero variance. Reading regression weight LD Score from /local/projects-t2/SIGN/analysis/depression/LDSC/eur_w_ld_chr/[1-22] ... Read regression weight LD Scores for 1293150 SNPs. After merging with reference panel LD, 114456 SNPs remain. After merging with regression SNP LD, 114456 SNPs remain. Computing rg for phenotype 2/2 Reading summary statistics from /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz ... Read summary statistics for 1217311 SNPs. After merging with summary statistics, 114456 SNPs remain. 114445 SNPs with valid alleles. WARNING: number of SNPs less than 200k; this is almost always bad.

Heritability of phenotype 1

ERROR computing rg for phenotype 2/2, from file /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz. Traceback (most recent call last): File "/local/projects-t2/SIGN/packages/ldsc/ldscore/sumstats.py", line 345, in estimate_rg _print_gencor(args, log, rghat, ref_ld_cnames, i, rg_paths, i == 0) File "/local/projects-t2/SIGN/packages/ldsc/ldscore/sumstats.py", line 417, in _print_gencor log.log(rghat.hsq1.summary(ref_ld_cnames, P=P[0], K=K[0])) File "/local/projects-t2/SIGN/packages/ldsc/ldscore/regressions.py", line 442, in summary c = h2_obs_to_liab(1, P, K) File "/local/projects-t2/SIGN/packages/ldsc/ldscore/regressions.py", line 130, in h2_obs_to_liab raise ValueError('K must be in the range (0,1)') ValueError: K must be in the range (0,1)

Can you please help me to solve this?

Thanks for your help.

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Hello Raymond,

I don't see any issue with munge_sumstats.py run. Though, I have pasted the content of the log files below. Can you please have a quick look at it and let me know if you see any potential issue? Thanks for your help.

====> Log file for first dataset ==========


Interpreting column names as follows: a1: Allele 1, interpreted as ref allele for signed sumstat. pval: p-Value snpid: Variant ID (e.g., rs number)info: INFO score (imputation quality; higher --> better imputation) a2: Allele 2, interpreted as non-ref allele for signed sumstat.

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/depression_summary_statistics.txt into memor y 5000000.0 SNPs at a time. Read 123025 SNPs from --sumstats file. Removed 8534 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 114491 SNPs remain. Removed 0 SNPs with duplicated rs numbers (114491 SNPs remain). Using N = 18759.0 Removed 5 SNPs whose alleles did not match --merge-alleles (114486 SNPs remain). Writing summary statistics for 1217311 SNPs (114486 with nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/depression.sumstats.gz.

Metadata: Mean chi^2 = 1.595 Lambda GC = 2.088 Max chi^2 = 24.755 0 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Fri Feb 26 15:53:56 2016

Total time elapsed: 8.95s

====> Log file for second dataset ====


Interpreting column names as follows: P_FIXED: p-Value CODED_ALLELE_FREQ: Allele frequency A1: Allele 1, interpreted as ref allele for signed sumstat. beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing) A2: Allele 2, interpreted as non-ref allele for signed sumstat. SNP: Variant ID (e.g., rs number)

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out into memory 5000000.0 SNPs at a time. Read 10727831 SNPs from --sumstats file. Removed 9528559 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 9528694 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 2 variants that were not SNPs or were strand-ambiguous. 1199135 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1199135 SNPs remain). Using N = 49324.0 Median value of beta was -0.0002, which seems sensible. Removed 9 SNPs whose alleles did not match --merge-alleles (1199126 SNPs remain). Writing summary statistics for 1217311 SNPs (1199126 with nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz.

Metadata: Mean chi^2 = 1.04 Lambda GC = 1.039 Max chi^2 = 41.004 9 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Sat Feb 27 16:11:20 2016

Total time elapsed: 1.0m:27.45s

I have one question about the second dataset log file. As you can see, the script did not find 9528559 SNPs in HapMap3 SNP list file. SO it removed them from the analysis. The script is also removed 9528694 SNPs with MAF <= 0.01. I am just wondering that the script actually did not remove 9528559 + 9528694 SNPs. So why did the script log that two numbers? Can you please elaborate that?

Thanks for all your help.

rkwalters commented 8 years ago

Hi Tushar, I agree that the munge_sumstats.py log looks like it ran correctly. I wasn’t expecting that your input GWAS for the first dataset had only ~120k SNPs. Seems unusually small? This may or may not be an issue depending on how representative these SNPs are, since with the default setup ldsc will be making some strong assumptions to extrapolate from these SNPs to estimate h2.

The log is incorrectly reporting SNPs removed from merge-alleles as also removed for MAF. So the actual number for MAF should be 9528694-9528559=135.

Cheers, Raymond

On Feb 29, 2016, at 7:31 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Hello Raymond,

I don't see any issue with munge_sumstats.py run. Though, I have pasted the content of the log files below. Can you please have a quick look at it and let me know if you see any potential issue? Thanks for your help.

====> Log file for first dataset ==========

LD Score Regression (LDSC) Version 1.0.0 (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane Broad Institute of MIT and Harvard / MIT Department of Mathematics GNU General Public License v3 ***** Call: ./munge_sumstats.py \ --out /local/projects-t2/SIGN/analysis/depression/LDSC/depression \ --merge-alleles /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist \ --a1-inc \ --N 18759.0 \ --sumstats /local/projects-t2/SIGN/analysis/depression/LDSC/depression_summary_statistics.txt Interpreting column names as follows: a1: Allele 1, interpreted as ref allele for signed sumstat. pval: p-Value snpid: Variant ID (e.g., rs number)info: INFO score (imputation quality; higher --> better imputation) a2: Allele 2, interpreted as non-ref allele for signed sumstat.

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/depression_summary_statistics.txt into memor y 5000000.0 SNPs at a time. Read 123025 SNPs from --sumstats file. Removed 8534 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 114491 SNPs remain. Removed 0 SNPs with duplicated rs numbers (114491 SNPs remain). Using N = 18759.0 Removed 5 SNPs whose alleles did not match --merge-alleles (114486 SNPs remain). Writing summary statistics for 1217311 SNPs (114486 with nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/depression.sumstats.gz.

Metadata: Mean chi^2 = 1.595 Lambda GC = 2.088 Max chi^2 = 24.755 0 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Fri Feb 26 15:53:56 2016

Total time elapsed: 8.95s

====> Log file for second dataset ====

LD Score Regression (LDSC) Version 1.0.0 (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane Broad Institute of MIT and Harvard / MIT Department of Mathematics GNU General Public License v3 ***** Call: ./munge_sumstats.py \ --out /local/projects-t2/SIGN/analysis/depression/LDSC/sign \ --merge-alleles /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist \ --frq CODED_ALLELE_FREQ \ --N 49324.0 \ --snp SNP \ --sumstats /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out \ --p P_FIXED Interpreting column names as follows: P_FIXED: p-Value CODED_ALLELE_FREQ: Allele frequency A1: Allele 1, interpreted as ref allele for signed sumstat. beta: [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing) A2: Allele 2, interpreted as non-ref allele for signed sumstat. SNP: Variant ID (e.g., rs number)

Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /local/projects-t2/SIGN/analysis/depression/LDSC/stroke.meta_analysis.allstroke.EUR.all.out into memory 5000000.0 SNPs at a time. Read 10727831 SNPs from --sumstats file. Removed 9528559 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 9528694 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 2 variants that were not SNPs or were strand-ambiguous. 1199135 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1199135 SNPs remain). Using N = 49324.0 Median value of beta was -0.0002, which seems sensible. Removed 9 SNPs whose alleles did not match --merge-alleles (1199126 SNPs remain). Writing summary statistics for 1217311 SNPs (1199126 with nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz.

Metadata: Mean chi^2 = 1.04 Lambda GC = 1.039 Max chi^2 = 41.004 9 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Sat Feb 27 16:11:20 2016

Total time elapsed: 1.0m:27.45s

I have one question about the second dataset log file. As you can see, the script did not find 9528559 SNPs in HapMap3 SNP list file. SO it removed them from the analysis. The script is also removed 9528694 SNPs with MAF <= 0.01. I am just wondering that the script actually did not remove 9528559 + 9528694 SNPs. So why did the script log that two numbers? Can you please elaborate that?

Thanks for all your help.

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Thanks Raymond for your efforts. I also thought that 135 SNPs were actually removed as a result of MAF <= 0.01.

I would like to ask you a question about 1K Genome data. What should I do to get a list of 1K genome SNPs? Should I just download the VCF file of each chromosome and extract only three columns (i..e ID, Ref, Alt) and then concatenate all the chromosome file into one file? Then use that combined file as input to --merge-alleles option. Or is there any file available online which has all chromosome data?

Thanks.

rkwalters commented 8 years ago

I’m not aware of a full list of variants provided for 1KG, so extracting from the VCFs is probably the necessary solution. You’ll also want to filter that list of variants for e.g. maf, indels, strand-ambiguous SNPs, etc., and you’ll want to make sure your GWAS data are being filtered for imputation quality. Again, the 3 papers have more detailed discussion of this.

Also, should point out that for genetic correlation you’re ultimately going to be restricted to the overlapping SNPs between your two datasets regardless, so I doubt this will have a large impact since the vast majority of SNPs in your smaller dataset are already passing the HapMap3 list.

Cheers, Raymond

On Mar 1, 2016, at 11:32 AM, Tushar Harishbhai Dave notifications@github.com wrote:

Thanks Raymond for your efforts. I also thought that 135 SNPs were actually removed as a result of MAF <= 0.01.

I would like to ask you a question about 1K Genome data. What should I do to get a list of 1K genome SNPs? Should I just download the VCF file of each chromosome and extract only three columns (i..e ID, Ref, Alt) and then concatenate all the chromosome file into one file? Then use that combined file as input to --merge-alleles option. Or is there any file available online which has all chromosome data?

Thanks.

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

I agree with you about the overlapping SNPs point. No matter what reference data set I use to compare the alleles, LD score regression will only be performed for overlapping SNPs only.

Thanks for all your help.

Best, Tushar

tushardave26 commented 8 years ago

Hello Raymond,

Just curious, can I use meta-analysis results as summary statistics? Or is ldsc restricted to GWAS results?

rkwalters commented 8 years ago

Yep, genome-wide meta-analysis results are perfectly suitable for ldsc. Cheers, Raymond

On Mar 1, 2016, at 1:26 PM, Tushar Harishbhai Dave notifications@github.com wrote:

Hello Raymond,

Just curious, can I use meta-analysis results as summary statistics? Or is ldsc restricted to GWAS results?

— Reply to this email directly or view it on GitHub.

tushardave26 commented 8 years ago

Thanks for your prompt response.

hilaryfinucane commented 8 years ago

Hi Tushar,

The 1000G variants we used to compute LD scores are in the 1000G plink bim files, which can be downloaded with the other 1000G plink files from the directory with all our publicly available data: https://data.broadinstitute.org/alkesgroup/LDSCORE/.

Best,

Hilary

On Tue, Mar 1, 2016 at 1:30 PM, Tushar Harishbhai Dave < notifications@github.com> wrote:

Thanks for your prompt response.

— Reply to this email directly or view it on GitHub https://github.com/bulik/ldsc/issues/43#issuecomment-190843549.