JonJala / mtag

Python command line tool for Multi-Trait Analysis of GWAS (MTAG)
GNU General Public License v3.0
171 stars 54 forks source link

p-value out of bound and SVD did not converge #85

Open Qiaolan opened 4 years ago

Qiaolan commented 4 years ago

Hi,

Thanks in advance!

I encountered two problems:

  1. WARNING: 159 SNPs had P outside of (0,1]. The P column may be mislabeled.

    • I do check my data in R and make sure that all values in the "pval" column are between(0,1].
  2. Intel MKL ERROR: Parameter 4 was incorrect on entry to DGELSD. SVD did not converge in Linear Least Squares

Any help would be appreciated!

paturley commented 4 years ago

Can you post the log file from your analysis? It may help diagnose what's going on.

Also, have you verified that all the p-values are numeric?

On Wed, Jan 22, 2020, 3:09 PM Qiaolan notifications@github.com wrote:

Hi,

Thanks in advance!

I encountered two problems:

  1. WARNING: 159 SNPs had P outside of (0,1]. The P column may be mislabeled.

    • I do check my data in R and make sure that all values in the "pval" column are between(0,1].
  2. Intel MKL ERROR: Parameter 4 was incorrect on entry to DGELSD. SVD did not converge in Linear Least Squares

    • I have no idea about this error.

Any help would be appreciated!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/85?email_source=notifications&email_token=AFBUB5ILC7IPUERT5P45443Q7CRWPA5CNFSM4KKLX6BKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IIBZSMA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5LG3H3HKNAKEH5SYU3Q7CRWPANCNFSM4KKLX6BA .

Qiaolan commented 4 years ago

2020/01/22/02:55:53 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> <> <> MTAG: Multi-trait Analysis of GWAS <> Version: 1.0.8 <> (C) 2017 Omeed Maghzian, Raymond Walters, and Patrick Turley <> Harvard University Department of Economics / Broad Institute of MIT and Harvard <> GNU General Public License v3 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> <> Note: It is recommended to run your own QC on the input before using this program. <> Software-related correspondence: maghzian@nber.org <> All other correspondence: paturley@broadinstitute.org <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Calling ./mtag.py \ --stream-stdout \ --n-min 0.0 \ --sumstats mtag_amd.txt,mtag_thick.txt \ --out ./amd_result

2020/01/22/02:55:53 PM Beginning MTAG analysis... 2020/01/22/02:55:53 PM MTAG will use the Z column for analyses. 2020/01/22/02:56:06 PM Read in Trait 1 summary statistics (8388784 SNPs) from mtag_amd.txt ... 2020/01/22/02:56:06 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 2020/01/22/02:56:06 PM Munging Trait 1 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< 2020/01/22/02:56:06 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 2020/01/22/02:56:06 PM Interpreting column names as follows: 2020/01/22/02:56:06 PM snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

2020/01/22/02:56:06 PM Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. 2020/01/22/02:56:10 PM WARNING: 159 SNPs had P outside of (0,1]. The P column may be mislabeled. 2020/01/22/02:56:14 PM Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. 2020/01/22/02:56:14 PM NumExpr defaulting to 8 threads. 2020/01/22/02:56:19 PM Read 8388784 SNPs from --sumstats file. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 159 SNPs with out-of-bounds p-values. Removed 176851 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 8211774 SNPs remain. 2020/01/22/02:56:24 PM Removed 0 SNPs with duplicated rs numbers (8211774 SNPs remain). 2020/01/22/02:56:25 PM Removed 0 SNPs with N < 0.0 (8211774 SNPs remain). 2020/01/22/02:57:46 PM Median value of SIGNED_SUMSTAT was -0.00250663089957, which seems sensible. 2020/01/22/02:57:47 PM Dropping snps with null values 2020/01/22/02:57:47 PM Metadata: 2020/01/22/02:57:48 PM Mean chi^2 = 1.141 2020/01/22/02:57:48 PM Lambda GC = 1.0 2020/01/22/02:57:48 PM Max chi^2 = 1473.007 2020/01/22/02:57:48 PM 4842 Genome-wide significant SNPs (some may have been removed by filtering). 2020/01/22/02:57:48 PM Conversion finished at Wed Jan 22 14:57:48 2020 2020/01/22/02:57:48 PM Total time elapsed: 1.0m:42.48s 2020/01/22/02:58:01 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 2020/01/22/02:58:01 PM Munging of Trait 1 complete. SNPs remaining: 8211774 2020/01/22/02:58:01 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

2020/01/22/02:58:28 PM Read in Trait 2 summary statistics (8388784 SNPs) from mtag_thick.txt ... 2020/01/22/02:58:28 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 2020/01/22/02:58:28 PM Munging Trait 2 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< 2020/01/22/02:58:28 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 2020/01/22/02:58:28 PM Interpreting column names as follows: 2020/01/22/02:58:28 PM snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

2020/01/22/02:58:28 PM Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. 2020/01/22/02:58:41 PM Read 8388784 SNPs from --sumstats file. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 176855 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 8211929 SNPs remain. 2020/01/22/02:58:46 PM Removed 0 SNPs with duplicated rs numbers (8211929 SNPs remain). 2020/01/22/02:58:47 PM Removed 0 SNPs with N < 0.0 (8211929 SNPs remain). 2020/01/22/03:01:19 PM Median value of SIGNED_SUMSTAT was -0.000274727506475, which seems sensible. 2020/01/22/03:01:19 PM Dropping snps with null values 2020/01/22/03:01:20 PM Metadata: 2020/01/22/03:01:20 PM Mean chi^2 = 2.288 2020/01/22/03:01:20 PM Lambda GC = 3.21 2020/01/22/03:01:20 PM Max chi^2 = 522.536 2020/01/22/03:01:21 PM 13302 Genome-wide significant SNPs (some may have been removed by filtering). 2020/01/22/03:01:21 PM Conversion finished at Wed Jan 22 15:01:21 2020 2020/01/22/03:01:21 PM Total time elapsed: 2.0m:53.08s 2020/01/22/03:01:33 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 2020/01/22/03:01:33 PM Munging of Trait 2 complete. SNPs remaining: 8211929 2020/01/22/03:01:33 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

2020/01/22/03:01:52 PM Dropped 1261178 SNPs due to strand ambiguity, 6950596 SNPs remain in intersection after merging trait1 2020/01/22/03:02:08 PM Dropped 3926 SNPs due to inconsistent allele pairs from phenotype 2. 6946665 SNPs remain. 2020/01/22/03:02:13 PM Flipped the signs of of 68 SNPs to make them consistent with the effect allele orderings of the first trait. 2020/01/22/03:02:17 PM Dropped 0 SNPs due to strand ambiguity, 6946665 SNPs remain in intersection after merging trait2 2020/01/22/03:02:17 PM ... Merge of GWAS summary statistics complete. Number of SNPs: 6946665 2020/01/22/03:02:35 PM Using 6946665 SNPs to estimate Omega (0 SNPs excluded due to strand ambiguity) 2020/01/22/03:02:35 PM Estimating sigma.. 2020/01/22/03:03:11 PM SVD did not converge in Linear Least Squares Traceback (most recent call last): File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/mtag.py", line 1567, in mtag(args) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/mtag.py", line 1351, in mtag args.sigma_hat = estimate_sigma(DATA[not_SA], args) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/mtag.py", line 468, in estimate_sigma rg_results = sumstats_sig.estimate_rg(args_ldsc_rg, Logger_to_Logging()) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/sumstats.py", line 445, in estimate_rg rghat = _rg(loop, args, log, M_annot, ref_ld_cnames, w_ld_cname, k, i) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/sumstats.py", line 593, in _rg intercept_gencov=intercepts[2], n_blocks=n_blocks, twostep=args.two_step) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/regressions.py", line 689, in init slow=slow, twostep=twostep) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/regressions.py", line 347, in init slow=slow, step1_ii=step1_ii, old_weights=old_weights) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/regressions.py", line 214, in init x, yp, update_func, n_blocks, slow=slow, w=initial_w) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/irwls.py", line 67, in init x, y, update_func, n_blocks, w, slow=slow, separators=separators) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/irwls.py", line 114, in irwls new_w = np.sqrt(update_func(cls.wls(x, y, w))) File "/fs/project/PAS1501/gaolab/software/MTAG/mtag/ldsc_mod/ldscore/irwls.py", line 162, in wls coef = np.linalg.lstsq(x, y) File "/users/PAS1149/osu10039/.conda/envs/env_mtag/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 2236, in lstsq x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj) File "/users/PAS1149/osu10039/.conda/envs/env_mtag/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 109, in _raise_linalgerror_lstsq raise LinAlgError("SVD did not converge in Linear Least Squares") LinAlgError: SVD did not converge in Linear Least Squares 2020/01/22/03:03:11 PM Analysis terminated from error at Wed Jan 22 15:03:11 2020 2020/01/22/03:03:11 PM Total time elapsed: 7.0m:18.09s

Qiaolan commented 4 years ago

Can you post the log file from your analysis? It may help diagnose what's going on. Also, have you verified that all the p-values are numeric?

It is posted. Yes, I checked them in R by class() function and showed "numeric".

paturley commented 4 years ago

It looks like the program is failing during the LDSC step. Does LD Score Regression run on your data if you run it directly?

On Wed, Jan 22, 2020, 3:18 PM Qiaolan notifications@github.com wrote:

Can you post the log file from your analysis? It may help diagnose what's going on. Also, have you verified that all the p-values are numeric? … <#m2924768818417446856>

It is posted. Yes, I checked them in R by class() function and showed "numeric".

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/85?email_source=notifications&email_token=AFBUB5PXOYNJ3NEA3V7TDJ3Q7CSXVA5CNFSM4KKLX6BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJU6XBY#issuecomment-577366919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5JIKXDN2ENHRES5BCDQ7CSXVANCNFSM4KKLX6BA .

Qiaolan commented 4 years ago

It looks like the program is failing during the LDSC step. Does LD Score Regression run on your data if you run it directly?

I don't know. Sorry, I am new to this field and I have little knowledge about ldsc.

Qiaolan commented 4 years ago

It looks like the program is failing during the LDSC step. Does LD Score Regression run on your data if you run it directly?

I searched for LinAlgError: SVD did not converge in Linear Least Squares in Python community. It seems that inf or NaN in the data can lead to this error. Could you please give me any advice? I mean any possible problem in my data can lead to inf or NaN. Thanks!

Qiaolan commented 4 years ago

It looks like the program is failing during the LDSC step. Does LD Score Regression run on your data if you run it directly?

Excuse me, one more information. Originally, I had over 1 million SNPs. Just now, I tested a small sample that has only 1000 SNPs and MTAG ran successfully. Thanks!

paturley commented 4 years ago

This makes me suspect again that there may be inf or NaN values in your p-value column. Even if the column is typed as numeric, some of the values still may be inf or NaN. Can you verify that no SNP has a value of inf or NaN?

On Wed, Jan 22, 2020 at 3:57 PM Qiaolan notifications@github.com wrote:

It looks like the program is failing during the LDSC step. Does LD Score Regression run on your data if you run it directly? … <#m7959150426487303773>

Excuse, one more information. Originally, I had over 1 million SNPs. Just now, I tested a small sample that has only 1000 SNPs and MTAG ran successfully. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/85?email_source=notifications&email_token=AFBUB5LD4XA7SVYIZOTFFYLQ7CXKBA5CNFSM4KKLX6BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJVCLDA#issuecomment-577381772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5PJPNYUDV2US5UHJI3Q7CXKBANCNFSM4KKLX6BA .

Qiaolan commented 4 years ago

This makes me suspect again that there may be inf or NaN values in your p-value column. Even if the column is typed as numeric, some of the values still may be inf or NaN. Can you verify that no SNP has a value of inf or NaN?

Thanks for your reply. I checked my data in R and it looks like "pval" column is good.

class(dta$pval) [1] "numeric" summary(dta$pval) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.2480 0.5000 0.4991 0.7510 1.0000

sum(is.finite(dta$pval)==0) [1] 0 sum(is.na(dta$pval)==1) [1] 0

Qiaolan commented 4 years ago

This makes me suspect again that there may be inf or NaN values in your p-value column. Even if the column is typed as numeric, some of the values still may be inf or NaN. Can you verify that no SNP has a value of inf or NaN?

Just now, I tested a sample of 10000 SNPs. There is no warning of out-of-bound p-values, but still not converge...

paturley commented 4 years ago

What happens if you drop the SNPs with p-values exactly equal to 0?

On Thu, Jan 23, 2020 at 11:02 AM Qiaolan notifications@github.com wrote:

This makes me suspect again that there may be inf or NaN values in your p-value column. Even if the column is typed as numeric, some of the values still may be inf or NaN. Can you verify that no SNP has a value of inf or NaN? … <#m-1956844415793548177>

Just now, I tested a sample of 10000 SNPs. There is no warning of out-of-bound p-values, but still not converge...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/85?email_source=notifications&email_token=AFBUB5PYPO654POMX3KJVW3Q7G5QTA5CNFSM4KKLX6BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJX3Q2Y#issuecomment-577747051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5PC4KMUIMGG5IFZA6DQ7G5QTANCNFSM4KKLX6BA .

Qiaolan commented 4 years ago

What happens if you drop the SNPs with p-values exactly equal to 0?

Problem solved. Z values have some inf/-inf, but somehow R did not recognize them (or I forgot to chekck). Thanks for your patience!

paturley commented 4 years ago

No problem. Glad we sorted it out.

On Thu, Jan 23, 2020 at 1:54 PM Qiaolan notifications@github.com wrote:

What happens if you drop the SNPs with p-values exactly equal to 0? … <#m-1288321431407798939>

Problem solved. Z values have some inf/-inf, but somehow R did not recognize them (or I forgot to chekck). Thanks for your patience!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/85?email_source=notifications&email_token=AFBUB5N7NE3Y3OS2CCH6EDTQ7HRVJA5CNFSM4KKLX6BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJYN64Y#issuecomment-577822579, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5KRR66H7ZPSTXRHXR3Q7HRVJANCNFSM4KKLX6BA .