JonJala / mtag

Python command line tool for Multi-Trait Analysis of GWAS (MTAG)
GNU General Public License v3.0
169 stars 54 forks source link

GWAS mean chi2 and MetaData mean chi2 #141

Closed mojolin2 closed 2 years ago

mojolin2 commented 3 years ago

Dear mtaggers,

First of all thank you for putting out such a cool and efficient tool. I was running the latest code on some lipids and blood pressure GWAS data from UK Biobank, GWAS from BOLT-LMM. While the job finished, I noticed 'Sigma matrix used is still non-positive-definite.' in the logs. I read in the issues history that this could be related to too high of correlation between the traits, say LDL and Total Chol, r > .95, so I took out the Total Chol but still got the error, it seems like it was because of GWAS mean chi2 for LDL was 0.99, though in the beginning of the log, the MetaData had the chi2 value for LDL (trait 1) to be 1.73. The sample size is about 300K. Any recommendations?

Best- jake

Beginning MTAG analysis... MTAG will use the Z column for analyses. Read in Trait 1 summary statistics (18116736 SNPs) from LDL2_mtag_in.txt ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 1 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 18116736 SNPs from --sumstats file. Removed 1041 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 1685198 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 16430497 SNPs remain. Removed 18699 SNPs with duplicated rs numbers (16411798 SNPs remain). Removed 0 SNPs with N < 0.0 (16411798 SNPs remain). Median value of SIGNED_SUMSTAT was -0.00642259715909, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.728 Lambda GC = 1.254 Max chi^2 = 1408.856 62701 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Thu Sep 9 10:22:05 2021 Total time elapsed: 4.0m:45.41s <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 1 complete. SNPs remaining: 16430497 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 1: Dropped 18699 SNPs for duplicate values in the "snp_name" column Read in Trait 2 summary statistics (18116736 SNPs) from HDL_mtag_in.txt ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 2 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 18116736 SNPs from --sumstats file. Removed 1199 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 1685184 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 16430353 SNPs remain. Removed 18699 SNPs with duplicated rs numbers (16411654 SNPs remain). Removed 0 SNPs with N < 0.0 (16411654 SNPs remain). Median value of SIGNED_SUMSTAT was -0.00403410078196, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.786 Lambda GC = 1.311 Max chi^2 = 1408.856 58307 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Thu Sep 9 10:28:44 2021 Total time elapsed: 4.0m:43.8s <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 2 complete. SNPs remaining: 16430353 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 2: Dropped 18699 SNPs for duplicate values in the "snp_name" column Read in Trait 3 summary statistics (18116736 SNPs) from TRIG_mtag_in.txt ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 3 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 18116736 SNPs from --sumstats file. Removed 1041 SNPs with missing values. ... Sigma hat: [[ 1.769 -0.134 0.288 0.131] [-0.134 1.41 -0.469 -0.043] [ 0.288 -0.469 1.087 0.123] [ 0.131 -0.043 0.123 1.098]] Mean chi^2 of SNPs used to estimate Omega is low for some SNPsMTAG may not perform well in this situation. Beginning estimation of Omega ... Using GMM estimator of Omega .. Checking for positive definiteness .. matrix is not positive definite, performing adjustment.. Warning: max number of iterations reached in adjustment procedure. Sigma matrix used is still non-positive-definite. Completed estimation of Omega ... Beginning MTAG calculations... ... Completed MTAG calculations. Writing Phenotype 1 to file ... Writing Phenotype 2 to file ... Writing Phenotype 3 to file ... Writing Phenotype 4 to file ...

Summary of MTAG results:

Trait # SNPs used N (max) N (mean) GWAS mean chi^2 MTAG mean chi^2 GWAS equiv. (max) N 1 LDL2_mtag_in.txt 13874381 340951 340951 0.999 0.943 14945144 2 HDL_mtag_in.txt 13874381 290000 290000 1.285 1.193 196097 3 TRIG_mtag_in.txt 13874381 290000 290000 1.545 1.362 192295 4 SBP_mtag_in.txt 13874381 290000 290000 1.485 1.489 292886

Estimated Omega: [[-6.758e-09 2.021e-11 4.492e-11 -5.942e-13] [ 2.021e-11 1.387e-06 -4.635e-11 -1.009e-11] [ 4.492e-11 -4.635e-11 2.044e-06 6.511e-12] [-5.942e-13 -1.009e-11 6.511e-12 1.835e-06]]

(Correlation): [[ nan nan nan nan] [ nan 1.000e+00 -2.753e-05 -6.326e-06] [ nan -2.753e-05 1.000e+00 3.361e-06] [ nan -6.326e-06 3.361e-06 1.000e+00]]

Estimated Sigma: [[ 1.769 -0.134 0.288 0.131] [-0.134 1.41 -0.469 -0.043] [ 0.288 -0.469 1.087 0.123] [ 0.131 -0.043 0.123 1.098]]

(Correlation): [[ 1. -0.085 0.208 0.094] [-0.085 1. -0.379 -0.035] [ 0.208 -0.379 1. 0.113] [ 0.094 -0.035 0.113 1. ]]

paturley commented 3 years ago

What's actually killing you in this case is that the LD score intercept is greater than the mean chi2 stat. That is a bit surprising to me for a trait like LDL. When you run LD score regression on those summary statistics, do you get a very large LD score intercept?

On Thu, Sep 9, 2021 at 7:28 AM mojolin2 @.***> wrote:

Dear mtaggers,

First of all thank you for putting out such a cool and efficient tool. I was running the latest code on some lipids and blood pressure GWAS data from UK Biobank, GWAS from BOLT-LMM. While the job finished, I noticed 'Sigma matrix used is still non-positive-definite.' in the logs. I read in the issues history that this could be related to too high of correlation between the traits, say LDL and Total Chol, r > .95, so I took out the Total Chol but still got the error, it seems like it was because of GWAS mean chi2 for LDL was 0.99, though in the beginning of the log, the MetaData had the chi2 value for LDL (trait 1) to be 1.73. The sample size is about 300K. Any recommendations?

Best- jake

Beginning MTAG analysis... MTAG will use the Z column for analyses. Read in Trait 1 summary statistics (18116736 SNPs) from LDL2_mtag_in.txt ...

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 1 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 18116736 SNPs from --sumstats file. Removed 1041 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 1685198 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 16430497 SNPs remain. Removed 18699 SNPs with duplicated rs numbers (16411798 SNPs remain). Removed 0 SNPs with N < 0.0 (16411798 SNPs remain). Median value of SIGNED_SUMSTAT was -0.00642259715909, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.728 Lambda GC = 1.254 Max chi^2 = 1408.856 62701 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Thu Sep 9 10:22:05 2021 Total time elapsed: 4.0m:45.41s

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 1 complete. SNPs remaining: 16430497

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 1: Dropped 18699 SNPs for duplicate values in the "snp_name" column Read in Trait 2 summary statistics (18116736 SNPs) from HDL_mtag_in.txt ...

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 2 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 18116736 SNPs from --sumstats file. Removed 1199 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 1685184 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 16430353 SNPs remain. Removed 18699 SNPs with duplicated rs numbers (16411654 SNPs remain). Removed 0 SNPs with N < 0.0 (16411654 SNPs remain). Median value of SIGNED_SUMSTAT was -0.00403410078196, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.786 Lambda GC = 1.311 Max chi^2 = 1408.856 58307 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Thu Sep 9 10:28:44 2021 Total time elapsed: 4.0m:43.8s

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 2 complete. SNPs remaining: 16430353

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 2: Dropped 18699 SNPs for duplicate values in the "snp_name" column Read in Trait 3 summary statistics (18116736 SNPs) from TRIG_mtag_in.txt ...

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 3 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 18116736 SNPs from --sumstats file. Removed 1041 SNPs with missing values. ... Sigma hat: [[ 1.769 -0.134 0.288 0.131] [-0.134 1.41 -0.469 -0.043] [ 0.288 -0.469 1.087 0.123] [ 0.131 -0.043 0.123 1.098]] Mean chi^2 of SNPs used to estimate Omega is low for some SNPsMTAG may not perform well in this situation. Beginning estimation of Omega ... Using GMM estimator of Omega .. Checking for positive definiteness .. matrix is not positive definite, performing adjustment.. Warning: max number of iterations reached in adjustment procedure. Sigma matrix used is still non-positive-definite. Completed estimation of Omega ... Beginning MTAG calculations... ... Completed MTAG calculations. Writing Phenotype 1 to file ... Writing Phenotype 2 to file ... Writing Phenotype 3 to file ... Writing Phenotype 4 to file ... Summary of MTAG results:

Trait # SNPs used N (max) N (mean) GWAS mean chi^2 MTAG mean chi^2 GWAS equiv. (max) N 1 LDL2_mtag_in.txt 13874381 340951 340951 0.999 0.943 14945144 2 HDL_mtag_in.txt 13874381 290000 290000 1.285 1.193 196097 3 TRIG_mtag_in.txt 13874381 290000 290000 1.545 1.362 192295 4 SBP_mtag_in.txt 13874381 290000 290000 1.485 1.489 292886

Estimated Omega: [[-6.758e-09 2.021e-11 4.492e-11 -5.942e-13] [ 2.021e-11 1.387e-06 -4.635e-11 -1.009e-11] [ 4.492e-11 -4.635e-11 2.044e-06 6.511e-12] [-5.942e-13 -1.009e-11 6.511e-12 1.835e-06]]

(Correlation): [[ nan nan nan nan] [ nan 1.000e+00 -2.753e-05 -6.326e-06] [ nan -2.753e-05 1.000e+00 3.361e-06] [ nan -6.326e-06 3.361e-06 1.000e+00]]

Estimated Sigma: [[ 1.769 -0.134 0.288 0.131] [-0.134 1.41 -0.469 -0.043] [ 0.288 -0.469 1.087 0.123] [ 0.131 -0.043 0.123 1.098]]

(Correlation): [[ 1. -0.085 0.208 0.094] [-0.085 1. -0.379 -0.035] [ 0.208 -0.379 1. 0.113] [ 0.094 -0.035 0.113 1. ]]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/141, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5LP67RKSPB345MKZADUBCK5ZANCNFSM5DW5GONA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mojolin2 commented 3 years ago

Hi, thanks for the fast response and LDSC tips: The LDSC Intercept for LDL was:
Intercept: 1.1487 (0.0201), smaller than its Mean Chi^2: 2.4876 For comparison, the HDL intercept was similar: Intercept: 1.1667 (0.0197), Mean Chi^2: 2.7523 Within the MTAG results, which meta data value represents the the intercept value?

LDSC LDL outputs: Total Observed scale h2: 0.2703 (0.0643)

Lambda GC: 1.4729

Mean Chi^2: 2.4876

Intercept: 1.1487 (0.0201)

Ratio: 0.0999 (0.0135)

Analysis finished at Fri Sep 10 09:19:29 2021

Total time elapsed: 1.0m:12.13s

HDL: Total Observed scale h2: 0.3104 (0.0593)

Lambda GC: 1.7589

Mean Chi^2: 2.7523

Intercept: 1.1667 (0.0197)

Ratio: 0.0951 (0.0113)

Analysis finished at Fri Sep 10 09:27:20 2021

paturley commented 3 years ago

Hmm. Very puzzling. What is the command you are using to call MTAG?

On Fri, Sep 10, 2021 at 4:34 AM mojolin2 @.***> wrote:

Hi, thanks for the fast response and LDSC tips: The LDSC Intercept for LDL was: Intercept: 1.1487 (0.0201), smaller than its Mean Chi^2: 2.4876 For comparison, the HDL intercept was similar: Intercept: 1.1667 (0.0197), Mean Chi^2: 2.7523 Within the MTAG results, which meta data value represents the the intercept value?

LDSC LDL outputs: Total Observed scale h2: 0.2703 (0.0643)

Lambda GC: 1.4729

Mean Chi^2: 2.4876

Intercept: 1.1487 (0.0201)

Ratio: 0.0999 (0.0135)

Analysis finished at Fri Sep 10 09:19:29 2021

Total time elapsed: 1.0m:12.13s

HDL: Total Observed scale h2: 0.3104 (0.0593)

Lambda GC: 1.7589

Mean Chi^2: 2.7523

Intercept: 1.1667 (0.0197)

Ratio: 0.0951 (0.0113)

Analysis finished at Fri Sep 10 09:27:20 2021

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/141#issuecomment-916732768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5MVLX5UMA4GOFFB2O3UBG7KPANCNFSM5DW5GONA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mojolin2 commented 3 years ago

nohup python ./mtag.py --sumstats HDL_mtag_in.txt,LDL_mtag_in.txt,TRIG_mtag_in.txt,CHOL_mtag_in.txt,SBP_mtag_in.txt --out ./HDL_LDL_TRIG_CHOL_SBP --n_min 0.0 --stream_stdout &

It is the latest version of MTAG

paturley commented 2 years ago

Hmm. This is still puzzling to me. You seem to have a ton of SNPs in your summary statistics, which makes me wonder if this is being caused by rare variants. What happens if you first filter out SNPs that have a MAF<.01 or .05?

On Sun, Sep 12, 2021 at 2:16 PM mojolin2 @.***> wrote:

nohup python ./mtag.py --sumstats HDL_mtag_in.txt,LDL_mtag_in.txt,TRIG_mtag_in.txt,CHOL_mtag_in.txt,SBP_mtag_in.txt --out ./HDL_LDL_TRIG_CHOL_SBP --n_min 0.0 --stream_stdout &

It is the latest version of MTAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/141#issuecomment-917684430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5L73MUQY34R3Q4BF7DUBTU6PANCNFSM5DW5GONA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mojolin2 commented 2 years ago

Hi, sorry it took so long to test this. running the LDL GWAS with MAF >= .01 fixed the issue where the LD score intercept is greater than the mean chi2 stat. The warning message 'Sigma matrix used is still non-positive-definite.' is no longer produced. Not sure how removal of all rare variants does to PRS, but it is good to know and big thanks for the support.

paturley commented 2 years ago

No worries. In general, I worry that MTAG's assumptions may be violated if you include both common and rare SNPs. If you really would like to include the rare SNPs in your PRS, you could always merge the MTAG summary statistics with the GWAS summary statistics for SNPs that were omitted, but I suspect that it wouldn't lead to large improvements in power.

On Fri, Oct 8, 2021 at 5:44 AM mojolin2 @.***> wrote:

Hi, sorry it took so long to test this. running the LDL GWAS with MAF >= .01 fixed the issue where the LD score intercept is greater than the mean chi2 stat. The warning message 'Sigma matrix used is still non-positive-definite.' is no longer produced. Not sure how removal of all rare variants does to PRS, but it is good to know and big thanks for the support.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/141#issuecomment-938503003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5PNI3QIZBDMZWAHM2DUF24Q7ANCNFSM5DW5GONA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mojolin2 commented 2 years ago

Good to know and thanks again.