JonJala / mtag

Python command line tool for Multi-Trait Analysis of GWAS (MTAG)
GNU General Public License v3.0
169 stars 54 forks source link

Error in log file and abnormal results #79

Open Lassana1 opened 4 years ago

Lassana1 commented 4 years ago

Hi all, I ran MTAG for 6 traits. The summary statistics were from GWAS using mixed models. Trait 5 was categorical, and all others continuous. To run MTAG, I converted the z score of trait 5 to beta and SE values. 1) I am attaching the log of the run. It says "analysis terminated from error". I am not sure what the error is. 2) However, all 6 results files were in the output. I checked the results and there are about 1000 - 70000 snps of interest (pval <5e-08) for each trait.

This seems to be very abnormal results to me. Any help is appreciated :) mtag.alltrait.log

paturley commented 4 years ago

I think your sample size may be too small for MTAG to behave well. It looks like the mean chi2 statistic for trait 5 is less than 1, which causes subsequent steps of MTAG to produce irrational results. Have you tried omitting summary statistics for traits with a mean chi2 less than one?

On Tue, Oct 8, 2019 at 11:05 AM Lassana1 notifications@github.com wrote:

Hi all, I ran MTAG for 6 traits. The summary statistics were from GWAS using mixed models. Trait 5 was categorical, and all others continuous. To run MTAG, I converted the z score of trait 5 to beta and SE values.

  1. I am attaching the log of the run. It says "analysis terminated from error". I am not sure what the error is.
  2. However, all 6 results files were in the output. I checked the results and there are about 1000 - 70000 snps of interest (pval <5e-08) for each trait.

This seems to be very abnormal results to me. Any help is appreciated :) mtag.alltrait.log https://github.com/omeed-maghzian/mtag/files/3703145/mtag.alltrait.log

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/79?email_source=notifications&email_token=AFBUB5KSEYYMTLDZEUQYJ3LQNSOSNA5CNFSM4I6TAMJ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HQL7WPA, or mute the thread https://github.com/notifications/unsubscribe-auth/AFBUB5MBPB4D3AXJ7YQ5YPLQNSOSNANCNFSM4I6TAMJQ .

Lassana1 commented 4 years ago

Thanks! Yes, I tried running all the other traits. I get a warning message "Removed 2783019 SNPs with missing values". What is the meaning of this error?

Lassana1 commented 4 years ago

Hi, I am attaching my log here. I reran a new analysis with traits where chi2 > 1, and also changed the freq values(my old calculations were wrong). 1) I am getting the same issue where snps of interests in the results vary between 1000-70000 for the traits 2) I am not sure what are missing values in this error "Removed 2783019 SNPs with missing values"? 3) There is an error at the end "ValueError: cannot convert float NaN to integer"

This is my program: python /home/lsamarakoon/mtag/mtag.py --sumstat p1.txt,p2.txt,p3.txt,p4.txt,gscore.txt --snp_name rsID --chr_name chr --bpos_name position --a1_name alleleA --a2_name alleleB --eaf_name ea_freq --beta_name Est --se_name SE --n_name n --p_name Wald.pval --force --use_beta_se --out mtag_results/onlyconttraits_p1-4_gscore.txt &

Is there anyway I could improve this analysis? Or do u think that all the issues due to the sample size? allconttraits.log.txt

Thanks so much for all the help.

Lassana1 commented 4 years ago

Looking at the results from this analysis, some traits have alot of missing mtag_pvalues. These are the no of missing p values: trait1- 147347 trait2- 8851014 trait3- 8883262 trait4- 79833 trait5-0

paturley commented 4 years ago

I think the problem is due to the small sample size. MTAG uses estimates of the genetic correlation and sample overlap using the summary statistics, but if the sample size is too small, sometimes these estimates are not sensible. (E.g., a sample overlap greater than 100% or negative variance.) I think that is what is happening here. From the log file, it looks as if traits 2 and 5 are very highly phenotypically correlated. Is that true? What happens if you drop trait 2?

On Tue, Oct 8, 2019 at 5:31 PM Lassana1 notifications@github.com wrote:

Looking at the results from this analysis, some traits have alot of missing mtag_pvalues. These are the no of missing p values: trait1- 147347 trait2- 8851014 trait3- 8883262 trait4- 79833 trait5-0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/79?email_source=notifications&email_token=AFBUB5NTG2USHLGPYQAEQ4LQNT335A5CNFSM4I6TAMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAVV4GA#issuecomment-539713048, or mute the thread https://github.com/notifications/unsubscribe-auth/AFBUB5ISYC6IO6KE6RIYRVLQNT335ANCNFSM4I6TAMJQ .

Lassana1 commented 4 years ago

Hi Patrick, you are right! Actually trait 5 was highly correlated with all the other traits. So I dropped traits2 and 5, and ran MTAG on 3 traits of interest. The results look normal now.
In my gwas summary statistics for the 3 traits: trait snps of interest (p<5e-08) 1 6 2 3 3 2

In my MTAG analysis log, trait 1 shows "6 genome-wide signif. snps", and trait 2 shows "3 genome-wide significant snps", BUT trait 3 shows "0 genome-wide significant snps". I checked the results for trait3, and the significant snps seem to have been dropped. In my summary stats for trait3, I realized the "rsID"s were missing for these 2 snps of interest. So, I ran another MTAG analysis using "snpName" for snpname. Both rsID and snpName seems to be dropping a significant number of snps/variants (even though my maf_min=0). Both these analysis still show "0 genome-wide significant snps" for trait3.

Q1) Are you able to explain why only the trait3 snps of interest are not detected by MTAG? Q2) In the analysis summary, the "GWAS equiv.(max)N is negative only for trait2. What does it mean?

Thank you so much! rsid.mafmin.strandamb.txt snpname.mafmin.strandamb.txt

I am attaching both log files here

Lassana1 commented 4 years ago

Hi, I also realized a warning in the log file "sigma matrix used is still not positive definite". Does this invalidate the results output from the mtag analysis?

Lassana1 commented 4 years ago

Hi, so I repeated the above analysis, but included variants. I used the methods proposed in issue #59 to "change no_alleles=False to no_alleles=True in line [154] and [156] of mtag.py". In this log, for trait 2, the 2 genome wide significant snps are identified. yay! :) However, it crashes with an "IDID" error. Any suggestions to remedy this? Attached is my log. snpname.mafmin.strandamb.includevariants.txt

paturley commented 4 years ago

Hi,

A non-positive definite matrix is likely due to a very small sample size interacting with perfect sample overlap and high phenotypic correlation. MTAG will behave pretty erratically if Sigma is non positive definite. Is it the case that you have perfect sample overlap across your different set of summary statistics? In that case, you could just pass in the correlation matrix of the phenotypes as your Sigma matrix. MTAG will not be doing a stratification correction in that case, but when a dataset as small as it look like you are using, the correction won't be very reliable anyways.

Re the "IDID" error, I'm not totally sure what is causing that. I suspect there are SNPs that have unusual values for allele 1 and 2 that would have been removed before you manually edited the code but which were not removed afterwards. You may be able to resolve this by doing some QC on your data and removing rows where the alleles look fishy.

Best, Patrick

On Fri, Oct 11, 2019 at 12:36 PM Lassana1 notifications@github.com wrote:

Hi, so I repeated the above analysis, but included variants. I used the methods proposed in issue #59 https://github.com/omeed-maghzian/mtag/issues/59 to "change no_alleles=False to no_alleles=True in line [154] and [156] of mtag.py". In this log, for trait 2, the 2 genome wide significant snps are identified. yay! :) However, it crashes with an "IDID" error. Any suggestions to remedy this? Attached is my log. snpname.mafmin.strandamb.includevariants.txt https://github.com/omeed-maghzian/mtag/files/3718682/snpname.mafmin.strandamb.includevariants.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/79?email_source=notifications&email_token=AFBUB5ONWZLFTWBJZRSOB2DQOCTSRA5CNFSM4I6TAMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBARK5Y#issuecomment-541136247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5PBF53WPAG5SZDIMPDQOCTSRANCNFSM4I6TAMJQ .

Lassana1 commented 4 years ago

Hi Patrick, thanks so much! Sorry, I should have mentioned before that my samples have a perfect overlap, and they are summary stats from the HCHS SOL cohort. I will try using the phenotype correlation matrix. Thanks!

Lassana1 commented 4 years ago

Hi Patrick, Below is the correlation matrix for my phenotypes, (calculated using spearman method on the beta estimates for each trait) for which I inputted into MTAG analysis using --residcov_path option.

         [,1]         [,2]         [,3]

[1,] 1.000000000 -0.002131698 0.001886724 [2,] -0.002131698 1.000000000 -0.002261145 [3,] 0.001886724 -0.002261145 1.000000000

Attached is my log. usingphenotypematrix.txt

While there are no errors in the log, the summary statistics for the three traits (mtag_z) are identical, and so are the snps of interest. These results do not make any sense. Any help to resolve this is appreciated :)

paturley commented 4 years ago

Do you have access to the raw data? What I meant was to use the Pearson correlation of the phenotypic values. This is only appropriate, however, if there is perfect sample overlap between your sets of summary statistics.

On Mon, Oct 14, 2019 at 12:38 PM Lassana1 notifications@github.com wrote:

Hi Patrick, Below is the correlation matrix for my phenotypes, (calculated using spearman method on the beta estimates for each trait) for which I inputted into MTAG analysis using --residcov_path option.

     [,1]         [,2]         [,3]

[1,] 1.000000000 -0.002131698 0.001886724 [2,] -0.002131698 1.000000000 -0.002261145 [3,] 0.001886724 -0.002261145 1.000000000

Attached is my log. usingphenotypematrix.txt https://github.com/omeed-maghzian/mtag/files/3725452/usingphenotypematrix.txt

While there are no errors in the log, the summary statistics for the three traits (mtag_z) are identical, and so are the snps of interest. These results do not make any sense. Any help to resolve this is appreciated :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/79?email_source=notifications&email_token=AFBUB5PAUNDSODWJ2N4PCPTQOSOARA5CNFSM4I6TAMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBFP2CY#issuecomment-541785355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5MJ4DDAZEYUFF2U53LQOSOARANCNFSM4I6TAMJQ .

Lassana1 commented 4 years ago

Hi Patrick, thank you for the clarification and all your help. I do have the raw data. A sample set of 7605 was used to carry out all GWAS analysis. However, due to missing data, the N used in the GWAS were: trait1- 7372, trait2-7562, trait3-7446. So I guess this means the sample overlap is not perfect?

Lassana1 commented 4 years ago

Hi, so I calculated a phenotype matrix, assuming the sample overlap is perfect, which is pretty close to the sigma hat calculated in mtag. phenotype matrix- [[ 1. 0.363 0.471] [ 0.363 1. 0.322] [ 0.471 0.322 1. ]]

sigma hat-calculated in mtag [[ 1.013 0.299 0.436] [ 0.299 1.012 0.286] [ 0.436 0.286 0.999]]

Even though the results look good when using the phenotype matrix, I am not sure if I can use them because the sample overlap might not be perfect (details in previous comment). What do you think?

Lassana1 commented 4 years ago

Can you explain the theory/assumptions behind passing off phenotype cov matrix for sigma matrix when sample overap is perfect? Sorry for the overload of questions! and thanks so much!!!

paturley commented 4 years ago

I'm in Houston for ASHG, so I have limited bandwidth to respond to this before next week. Sorry for the delay.

On Wed, Oct 16, 2019 at 1:13 PM Lassana1 notifications@github.com wrote:

Can you explain the theory/assumptions behind passing off phenotype cov matrix for sigma matrix when sample overap is perfect? Sorry for the overload of questions! and thanks so much!!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/79?email_source=notifications&email_token=AFBUB5MSP22ELRPNTNKIKHDQO5KTHA5CNFSM4I6TAMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBNN6JI#issuecomment-542826277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5OROO7EAY4FFIYSK7LQO5KTHANCNFSM4I6TAMJQ .

Lassana1 commented 4 years ago

That is no problem!!! Thanks for all the help!

Lassana1 commented 4 years ago

Hi :) Any help is appreciated for my questions. Thanks in advance :)

paturley commented 4 years ago

Hello,

Sorry for the delays here. It's been a very busy couple of weeks.

I think that the overlap is probably high enough for you to be OK, but you'll need to iglight in your paper that (1) your estimates are an approximation due to the imperfect overlap and (2) by using the correlation matrix, your estimates are not corrected for pop strat like standard MTAG estimates are.

I thought the theory was in the MTAG paper itself, but it looks like we didn't include it in the end. It's not a super hard proof though if you want to work through it. Just assume a null model for two standardized correlated traits, and show that their z stats for a simple regression of each phenotype on some standardized (null) SNP have the same correlation as the phenotypic correlation. You'll need to assume that you have a perfectly overlapping random sample, and you should probably use the asymptotic approximation that the standard error for each estimate is 1/sqrt(N).

Sorry I can't be more help.

On Thu, Oct 31, 2019 at 2:12 PM Lassana1 notifications@github.com wrote:

Hi :) Any help is appreciated for my questions. Thanks in advance :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/79?email_source=notifications&email_token=AFBUB5KQY2XGFF7GU4EUL6LQRMNYBA5CNFSM4I6TAMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYXYVA#issuecomment-548502612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5PSWLFUZUYKTW6ZD6DQRMNYBANCNFSM4I6TAMJQ .

Lassana1 commented 4 years ago

Hi Patrick, no worries! Thanks so much! I will work on the proof and look through my analysis again. This helps alot, thanks!!!