Open julyankb opened 6 years ago
Hi @julyankb ,
Could you try lowering the precision of the pvalues when you pre-process/QC the input data? The error happened when MTAG calls a LDSC module to munge the datasets. The error message in "allpvalues.log" below indicates that Python does not recognize those extreme values as positive anymore.
...
2018/03/19/01:44:52 PM WARNING: 61 SNPs had P outside of (0,1]. The P column may be mislabeled.
...
2018/03/19/01:59:31 PM WARNING: 65536 SNPs had P outside of (0,1]. The P column may be mislabeled.
Here is a quick test from Ipython that hopefully illustrates the issue here.
In [1]: test=2.2e-1316
In [2]: test>0
Out[2]: False
In [3]: test<0
Out[3]: False
In [4]: test==0
Out[4]: True
We will try to build in more checks in MTAG to inform people about issues as such in their data.
Thanks, Hui
I'm mostly just surprised you are getting p values that small. Either they must be for super rare alleles or they have implausibly large effect sizes. Can you verify that these SNPs don't have anything funny about that means you should drop them?
On Mon, Mar 19, 2018, 5:31 PM huilisabrina notifications@github.com wrote:
Hi @julyankb https://github.com/julyankb ,
It's generally not a good idea to drop the highly significant SNPs. Could you try lowering the precision of the pvalues when you pre-process/QC the input data? The error happened when MTAG calls a LDSC module to munge the datasets. The error message in "allpvalues.log" below indicates that Python does not recognize those extreme values as positive anymore.
... 2018/03/19/01:44:52 PM WARNING: 61 SNPs had P outside of (0,1]. The P column may be mislabeled. ... 2018/03/19/01:59:31 PM WARNING: 65536 SNPs had P outside of (0,1]. The P column may be mislabeled.
Here is a quick test from Ipython that hopefully illustrates the issue here.
In [1]: test=2.2e-1316 In [2]: test>0 Out[2]: False In [3]: test<0 Out[3]: False In [4]: test==0 Out[4]: True
We will try to build in more checks in MTAG to inform people about issues as such in their data.
Thanks, Hui
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/19#issuecomment-374386220, or mute the thread https://github.com/notifications/unsubscribe-auth/AUNA9WPX3RT7EWtupI0mV4YOBSJnwsG4ks5tgCOpgaJpZM4Sw1U3 .
Hi,
I work with @julyankb and can provide additional perspective. The SNPs in question are common variants of relatively large effect (0.1). On a relatively small cohort (N=8000) we obtain a P-value ~ 1e-30.
As you can imagine the same SNPs in UK Biobank will have a very low P-values. UKB is the cohort for which we want to use MTAG on two correlated traits.
These are GWAS signals that have been replicated in multiple peer-reviewed publications. Picking a random SNP at one of these GWAS signals in UKB, I get EAF of 0.55, BETA -0.09 and P of 1.3E-582.
I encountered the same issue in R, and solved it by using the Rmpfr
package. For Python there exists mpmath
. Clearly, this may entail significant refactoring of munge_stats.py
.
An alternative route may be to set the P-value for these SNPs to the lowest possible float in base Python and adjust the SE accordingly.
Any assistance or guidance you can provide would be greatly appreciated.
Thanks.
Hi Vince + Julyan,
The MTAG framework assumes a homogenous genetic variance-covariance matrix (Omega). Even if the SNP hits are highly significant and have been replicated in other studies, I would imagine it is unlikely that their effect sizes are consistent with this assumption. (See the MTAG SNP filters section of the Online Method in our paper for how we dealt with the inversion region that posed a similar problem in our neuroticism phenotype). I would suggest restricting your SNPs to those with an approximately homogenous Omega across the genome.
Omeed
Dear developers,
Run into the same problem. How we deal with it seems tricky. if we filtering by P value, which is related to sample size, nowaday, UKB bank-level data is really easy to get the P value smaller than the python float limit for the continuous trait.
Thank you very much for your help!
Best regards Wallace
Hi Wallace,
I'm sorry I haven't responded to your issue yet. It has been a crazy week. I'll get to it by the end of the week, if that's OK?
Patrick
On Wed, Nov 6, 2019 at 12:18 PM wavefancy notifications@github.com wrote:
Dear developers,
Run into the same problem. How we deal with it seems tricky. if we filtering by P value, which is related to sample size, nowaday, UKB bank-level data is really easy to get the P value smaller than the python float limit for the continuous trait.
- if we argue the 'homogenous genetic variance-covariance matrix', should be filtering by effect size? other than P value. a sample size 1000 we accept it, a sample 1000000, we exclude them ? even the effect size the same, just the beta_se estimation is different. It seems not a consistent solution.
- I am wondering can mtag deal with 'homogenous genetic variance-covariance matrix' a better way? If we exclude the most significant large effect snps, which probably are the most interesting signals we want to pursue in the next step. As @vforget https://github.com/vforget mentioned true signals.
- Can the developers suggest a way, how we exclude/include the snp for run MTAG, and then how we took them back if that's a true large effect signal.
Thank you very much for your help!
Best regards Wallace
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/19?email_source=notifications&email_token=AFBUB5PUPJTIMW7XOVFVUNDQSL343A5CNFSM4EWDKU32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDHJYNY#issuecomment-550411319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5KNYJLUNXHJ2DJIF6DQSL343ANCNFSM4EWDKU3Q .
Hello Wallace,
Sorry for the delay on this. Here are a few of my thoughts.
The violation of the homogeneous Omega assumption isn't about whether the SNP is highly significant or not, but outliers in effect sizes tend to highly significant. We had to deal with this in the original MTAG paper since there is locus for neuroticism that is much larger than any other effect size in the genome. This could easily be seen by examining the Manhattan plot. I don't believe that the p-values for the locus were so small that they rounded to zero, but we omitted that region from the MTAG analysis anyways since we worried that the relationship between the effects sizes for each of the traits may be different in that region that they were at other regions of the genome.
So in your case, it depends on whether the effect sizes for your variants with extremely low p-values are the exception or the rule (i.e., are they just in the tail of an otherwise smooth distribution or do they appear to be outliers). If they are outliers, you may want to drop the region containing that SNP (regardless of the p-value) so you don't violate the underlying assumptions of MTAG. If they are not outliers, it's not totally clear to me what the right answer is. I suspect that you'd run into few problems if you dropped regions containing SNPs with extreme p-values than if you just dropped the individual SNPs with extreme p-values. And this probably wouldn't be too costly for you from a discovery standpoint because you already have strong evidence of a signal in the locus.
In either case, you could just add the GWAS summary stats that you dropped back in when you are done as long as you signaled to readers which results are GWAS results and which are MTAG results in the final table.
But to be clear, this is just me taking educated guesses. I haven't tested this carefully.
Hope this is helpful.
Best, Patrick
On Mon, Nov 11, 2019 at 1:34 PM Patrick Turley paturley@gmail.com wrote:
Hi Wallace,
I'm sorry I haven't responded to your issue yet. It has been a crazy week. I'll get to it by the end of the week, if that's OK?
Patrick
On Wed, Nov 6, 2019 at 12:18 PM wavefancy notifications@github.com wrote:
Dear developers,
Run into the same problem. How we deal with it seems tricky. if we filtering by P value, which is related to sample size, nowaday, UKB bank-level data is really easy to get the P value smaller than the python float limit for the continuous trait.
- if we argue the 'homogenous genetic variance-covariance matrix', should be filtering by effect size? other than P value. a sample size 1000 we accept it, a sample 1000000, we exclude them ? even the effect size the same, just the beta_se estimation is different. It seems not a consistent solution.
- I am wondering can mtag deal with 'homogenous genetic variance-covariance matrix' a better way? If we exclude the most significant large effect snps, which probably are the most interesting signals we want to pursue in the next step. As @vforget https://github.com/vforget mentioned true signals.
- Can the developers suggest a way, how we exclude/include the snp for run MTAG, and then how we took them back if that's a true large effect signal.
Thank you very much for your help!
Best regards Wallace
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/19?email_source=notifications&email_token=AFBUB5PUPJTIMW7XOVFVUNDQSL343A5CNFSM4EWDKU32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDHJYNY#issuecomment-550411319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5KNYJLUNXHJ2DJIF6DQSL343ANCNFSM4EWDKU3Q .
Dear Patrick,
Thank you very much for your reply. We are able to make it run if we only use common variants, as we have more rare variants, it will have errors like this. So it makes sense as you said.
However, we are trying to check does the multitrait GWAS from MTAG can help us improve the prediction power of PRS, like the LDpred algorithm. It seems that are good in your NG paper. From the PRS standpoint of view, can I still leave the outliers(or strong effect loci) out and then merge it after the MTAG run?
Best regards Wallace
On Sun, Nov 17, 2019 at 9:19 AM paturley notifications@github.com wrote:
Hello Wallace,
Sorry for the delay on this. Here are a few of my thoughts.
The violation of the homogeneous Omega assumption isn't about whether the SNP is highly significant or not, but outliers in effect sizes tend to highly significant. We had to deal with this in the original MTAG paper since there is locus for neuroticism that is much larger than any other effect size in the genome. This could easily be seen by examining the Manhattan plot. I don't believe that the p-values for the locus were so small that they rounded to zero, but we omitted that region from the MTAG analysis anyways since we worried that the relationship between the effects sizes for each of the traits may be different in that region that they were at other regions of the genome.
So in your case, it depends on whether the effect sizes for your variants with extremely low p-values are the exception or the rule (i.e., are they just in the tail of an otherwise smooth distribution or do they appear to be outliers). If they are outliers, you may want to drop the region containing that SNP (regardless of the p-value) so you don't violate the underlying assumptions of MTAG. If they are not outliers, it's not totally clear to me what the right answer is. I suspect that you'd run into few problems if you dropped regions containing SNPs with extreme p-values than if you just dropped the individual SNPs with extreme p-values. And this probably wouldn't be too costly for you from a discovery standpoint because you already have strong evidence of a signal in the locus.
In either case, you could just add the GWAS summary stats that you dropped back in when you are done as long as you signaled to readers which results are GWAS results and which are MTAG results in the final table.
But to be clear, this is just me taking educated guesses. I haven't tested this carefully.
Hope this is helpful.
Best, Patrick
On Mon, Nov 11, 2019 at 1:34 PM Patrick Turley paturley@gmail.com wrote:
Hi Wallace,
I'm sorry I haven't responded to your issue yet. It has been a crazy week. I'll get to it by the end of the week, if that's OK?
Patrick
On Wed, Nov 6, 2019 at 12:18 PM wavefancy notifications@github.com wrote:
Dear developers,
Run into the same problem. How we deal with it seems tricky. if we filtering by P value, which is related to sample size, nowaday, UKB bank-level data is really easy to get the P value smaller than the python float limit for the continuous trait.
- if we argue the 'homogenous genetic variance-covariance matrix', should be filtering by effect size? other than P value. a sample size 1000 we accept it, a sample 1000000, we exclude them ? even the effect size the same, just the beta_se estimation is different. It seems not a consistent solution.
- I am wondering can mtag deal with 'homogenous genetic variance-covariance matrix' a better way? If we exclude the most significant large effect snps, which probably are the most interesting signals we want to pursue in the next step. As @vforget https://github.com/vforget mentioned true signals.
- Can the developers suggest a way, how we exclude/include the snp for run MTAG, and then how we took them back if that's a true large effect signal.
Thank you very much for your help!
Best regards Wallace
— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/omeed-maghzian/mtag/issues/19?email_source=notifications&email_token=AFBUB5PUPJTIMW7XOVFVUNDQSL343A5CNFSM4EWDKU32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDHJYNY#issuecomment-550411319 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFBUB5KNYJLUNXHJ2DJIF6DQSL343ANCNFSM4EWDKU3Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omeed-maghzian/mtag/issues/19?email_source=notifications&email_token=AALGO4W5ANDFSDG6SYT2JGTQUFHFLA5CNFSM4EWDKU32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEINBXI#issuecomment-554750173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGO4UEJORSQITO3UJHN3LQUFHFLANCNFSM4EWDKU3Q .
Error converting summary statistics
Input: Two GWAS's with 18,593,659 and 18,593,553 SNPs. P-value ranges = [1.4e-1019, 1.0], [2.2e-1316, 1.0] Running MTAG with default settings fails. See log below. allpvalues.log
I suspect that this is due to p-values being smaller than the smallest possible float value in python (p<2.23e-308) See https://stackoverflow.com/questions/1835787/what-is-the-range-of-values-a-float-can-have-in-python/1839009
Solution: Omit all SNPs with p<2.23e-308 from input files. MTAG runs successfully. Omitted SNPs can be logged in separate file with unmodified summary statistics.
New P-value ranges = [1.8e-302, 1.0], [2.8e-305, 1.0] New log file included below. minPval_andAbove.log