Why do we need the column "n" in sumStats input when running MTAG

JonJala / mtag

Python command line tool for Multi-Trait Analysis of GWAS (MTAG)

GNU General Public License v3.0

169 stars 54 forks source link

Why do we need the column "n" in sumStats input when running MTAG #185

Open allenXyGao opened 1 year ago

allenXyGao commented 1 year ago

Hi,

A great project! Currently I have some questions about the column "n" in the input of sumStats files: (1). in paper, you mentioned "we restrict variation in SNP sample sizes by calculating the 90th percentile of the SNP sample size distribution and removing SNPs with a sample size smaller than 75% of this value". Is it the only place where MATG uses the information of "n" for each SNP? Are there any other places MTAG uses these "n"?

(2). if the input sumStats files do not have such "n" column, how should we generate it by ourselves? I noticed that there are different answers for this question in ISSUES, so I am confused. Specifically, if I have three different studies: (a). case-control study with binary outcome; (b). quantitative study with continuous outcome; (3). survival study with (time, disease status) outcome, then how should we generate the "n" column for these three studies.

Thanks, Allen

paturley commented 1 year ago

Hi Allen,

MTAG also uses the N as a proxy for the SE^2 in the calculation. If you are pretty confident that the sample size is constant across SNPs, you can get a pretty good estimate of the appropriate value for N if you use

N = 1/(2p(1-p)*SE^2)

where p is the allele frequency of the SNP and SE is the standard error for the SNP. A heads up though that the betas and SEs that MTAG reports may be in different units than the units of the GWAS. One way to verify how the units differ is to pass each of the sets of sumstats through MTAG one at a time. The new betas should be a constant multiple of the old betas, which should inform how you can transform the MTAG betas into the original betas if you like. Even if you don't transform the betas and SE though, the Z-stat and p-values should be correct.

On Thu, Sep 7, 2023 at 10:48 AM allenXY @.***> wrote:

Hi,

A great project! Currently I have some questions about the column "n" in the input of sumStats files: (1). in paper, you mentioned "we restrict variation in SNP sample sizes by calculating the 90th percentile of the SNP sample size distribution and removing SNPs with a sample size smaller than 75% of this value". Is it the only place where MATG uses the information of "n" for each SNP? Are there any other places MTAG uses these "n"?

(2). if the input sumStats files do not have such "n" column, how should we generate it by ourselves? I noticed that there are different answers for this question in ISSUES, so I am confused. Specifically, if I have three different studies: (a). case-control study with binary outcome; (b). quantitative study with continuous outcome; (3). survival study with (time, disease status) outcome, then how should we generate the "n" column for these three studies.

Thanks, Allen

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5OY2O7VM2HWFD4RANDXZHNEFANCNFSM6AAAAAA4PAZYN4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

allenXyGao commented 1 year ago

Hi Patrick,

Thanks for sharing the information! Just want to confirm that we can simply use the same equation, i.e., N = 1/(2p(1-p)*SE^2), to obtain N for different kinds of studies, e.g., case/control study with binary outcome, or quantitative study with continuous outcome, or even survival study with (time, disease status) outcome, right?

ps: when using the formula, N = 1/(2p(1-p)*SE^2), should we assume that the phenotype is already standardized, in other words, var(y) is about 1? But what about the case/control study or survival study?

Thanks, Allen

paturley commented 1 year ago

You know, as I'm thinking about this harder, I think this is wrong, but I'm not sure what the right way forward is. Let me think about this a bit.

On Tue, Sep 12, 2023 at 3:40 PM allenXY @.***> wrote:

Hi Patrick,

Thanks for sharing the information! Just want to confirm that we can simply use the same equation, i.e., N = 1/(2p(1-p)*SE^2), to obtain N for different kinds of studies, e.g., case/control study with binary outcome, or quantitative study with continuous outcome, or even survival study with (time, disease status) outcome, right?

ps: when using the formula, N = 1/(2p(1-p)*SE^2), should we assume that the phenotype is already standardized, in other words, var(y) is about 1? But what about the case/control study or survival study?

Thanks, Allen

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1716615209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5N277IUKBGYLRLTK3TX2DQFXANCNFSM6AAAAAA4PAZYN4 . You are receiving this because you commented.Message ID: @.***>

dqq0404 commented 6 months ago

@paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

paturley commented 6 months ago

So the problem with using that formula is that it assumes that the phenotypic variance is one, which it won't be for a binary phenotype. For a binary phenotype, probably something like

N = (F(1- F))/(2p(1-p)*SE^2)

is better, but I'm hesitant to recommend it since I haven't tested it. If you do use it, I would do some stress testing to make sure it isn't producing results that don't make sense.

On Thu, Mar 14, 2024 at 9:42 PM dqq0404 @.***> wrote:

@paturley https://github.com/paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1998755985, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5MJUEFC6MS6VR4GWMTYYJGZJAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYG42TKOJYGU . You are receiving this because you were mentioned.Message ID: @.***>

dqq0404 commented 6 months ago

Oh! I would appreciate it if you can do the testing. I am looking for your reply! By the way, what does the F mean?

So the problem with using that formula is that it assumes that the phenotypic variance is one, which it won't be for a binary phenotype. For a binary phenotype, probably something like

N = (F(1- F))/(2p(1-p)*SE^2)

is better, but I'm hesitant to recommend it since I haven't tested it. If you do use it, I would do some stress testing to make sure it isn't producing results that don't make sense.

On Thu, Mar 14, 2024 at 9:42 PM dqq0404 @.***> wrote:

@paturley https://github.com/paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1998755985, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5MJUEFC6MS6VR4GWMTYYJGZJAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYG42TKOJYGU . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

paturley commented 6 months ago

Sorry, the F is the prevalence of the binary phenotype.

Unfortunately, I don't have bandwidth to test this myself right now (and probably not in the near future).

On Wed, Mar 20, 2024 at 10:29 AM dqq0404 @.***> wrote:

Oh! I would appreciate it if you can do the testing. I am looking for your reply! By the way, what does the F mean?

---- Replied Message ---- | From | @.> | | Date | 03/20/2024 22:22 | | To | JonJala/mtag @.> | | Cc | dqq0404 @.>, Comment @.> | | Subject | Re: [JonJala/mtag] Why do we need the column "n" in sumStats input when running MTAG (Issue #185) |

So the problem with using that formula is that it assumes that the phenotypic variance is one, which it won't be for a binary phenotype. For a binary phenotype, probably something like

N = (F(1- F))/(2p(1-p)*SE^2)

is better, but I'm hesitant to recommend it since I haven't tested it. If you do use it, I would do some stress testing to make sure it isn't producing results that don't make sense.

On Thu, Mar 14, 2024 at 9:42 PM dqq0404 @.***> wrote:

@paturley https://github.com/paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1998755985, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFBUB5MJUEFC6MS6VR4GWMTYYJGZJAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYG42TKOJYGU

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-2009709432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5IT4AWQTRGZ4JKHWHLYZGMMNAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YDSNBTGI . You are receiving this because you were mentioned.Message ID: @.***>

dqq0404 commented 6 months ago

Oh!I know that. Thank you for your relpy. But there is no prevalence in my sumstats. I looked at other questions about calculating n, and I saw a way to solve this problem:https://github.com/JonJala/mtag/issues/60#issuecomment-705606262 But I only have a total number of cases and controls rather cases and controls for each SNP.So can I use a total number to calculate n for each SNP?In this way, the n of each SNP is equal and is that reasonable?

Sorry, the F is the prevalence of the binary phenotype.

Unfortunately, I don't have bandwidth to test this myself right now (and probably not in the near future).

On Wed, Mar 20, 2024 at 10:29 AM dqq0404 @.***> wrote:

Oh! I would appreciate it if you can do the testing. I am looking for your reply! By the way, what does the F mean?

---- Replied Message ---- | From | @.> | | Date | 03/20/2024 22:22 | | To | JonJala/mtag @.> | | Cc | dqq0404 @.>, Comment @.> | | Subject | Re: [JonJala/mtag] Why do we need the column "n" in sumStats input when running MTAG (Issue #185) |

So the problem with using that formula is that it assumes that the phenotypic variance is one, which it won't be for a binary phenotype. For a binary phenotype, probably something like

N = (F(1- F))/(2p(1-p)*SE^2)

is better, but I'm hesitant to recommend it since I haven't tested it. If you do use it, I would do some stress testing to make sure it isn't producing results that don't make sense.

On Thu, Mar 14, 2024 at 9:42 PM dqq0404 @.***> wrote:

@paturley https://github.com/paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1998755985, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFBUB5MJUEFC6MS6VR4GWMTYYJGZJAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYG42TKOJYGU

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-2009709432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5IT4AWQTRGZ4JKHWHLYZGMMNAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YDSNBTGI . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

paturley commented 6 months ago

I think that using the total cases and controls in the data (rather than by SNP) is reasonable unless you have reason to believe that it would vary a lot by SNP.

On Wed, Mar 20, 2024 at 11:51 PM dqq0404 @.***> wrote:

Oh!I know that. Thank you for your relpy. But there is no prevalence in my sumstats. I looked at other questions about calculating n, and I saw a way to solve this problem: https://github.com/JonJala/mtag/issues/60#issuecomment-705606262 But I only have a total number of cases and controls rather cases and controls for each SNP.So can I use a total number to calculate n for each SNP?In this way, the n of each SNP is equal and is that reasonable?

---- Replied Message ---- | From | @.> | | Date | 03/20/2024 22:36 | | To | JonJala/mtag @.> | | Cc | dqq0404 @.>, Comment @.> | | Subject | Re: [JonJala/mtag] Why do we need the column "n" in sumStats input when running MTAG (Issue #185) |

Sorry, the F is the prevalence of the binary phenotype.

Unfortunately, I don't have bandwidth to test this myself right now (and probably not in the near future).

On Wed, Mar 20, 2024 at 10:29 AM dqq0404 @.***> wrote:

Oh! I would appreciate it if you can do the testing. I am looking for your reply! By the way, what does the F mean?

---- Replied Message ---- | From | @.> | | Date | 03/20/2024 22:22 | | To | JonJala/mtag @.> | | Cc | dqq0404 @.>, Comment @.> | | Subject | Re: [JonJala/mtag] Why do we need the column "n" in sumStats input when running MTAG (Issue #185) |

So the problem with using that formula is that it assumes that the phenotypic variance is one, which it won't be for a binary phenotype. For a binary phenotype, probably something like

N = (F(1- F))/(2p(1-p)*SE^2)

is better, but I'm hesitant to recommend it since I haven't tested it. If you do use it, I would do some stress testing to make sure it isn't producing results that don't make sense.

On Thu, Mar 14, 2024 at 9:42 PM dqq0404 @.***> wrote:

@paturley https://github.com/paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1998755985, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AFBUB5MJUEFC6MS6VR4GWMTYYJGZJAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYG42TKOJYGU

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-2009709432, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFBUB5IT4AWQTRGZ4JKHWHLYZGMMNAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YDSNBTGI

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-2011170006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5LL7IYEX5SZQXRS753YZJKK3AVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJRGE3TAMBQGY . You are receiving this because you were mentioned.Message ID: @.***>

dqq0404 commented 6 months ago

Thanks!!! You give me a lot help!

I think that using the total cases and controls in the data (rather than by SNP) is reasonable unless you have reason to believe that it would vary a lot by SNP.

On Wed, Mar 20, 2024 at 11:51 PM dqq0404 @.***> wrote:

Oh!I know that. Thank you for your relpy. But there is no prevalence in my sumstats. I looked at other questions about calculating n, and I saw a way to solve this problem: https://github.com/JonJala/mtag/issues/60#issuecomment-705606262 But I only have a total number of cases and controls rather cases and controls for each SNP.So can I use a total number to calculate n for each SNP?In this way, the n of each SNP is equal and is that reasonable?

---- Replied Message ---- | From | @.> | | Date | 03/20/2024 22:36 | | To | JonJala/mtag @.> | | Cc | dqq0404 @.>, Comment @.> | | Subject | Re: [JonJala/mtag] Why do we need the column "n" in sumStats input when running MTAG (Issue #185) |

Sorry, the F is the prevalence of the binary phenotype.

Unfortunately, I don't have bandwidth to test this myself right now (and probably not in the near future).

On Wed, Mar 20, 2024 at 10:29 AM dqq0404 @.***> wrote:

Oh! I would appreciate it if you can do the testing. I am looking for your reply! By the way, what does the F mean?

---- Replied Message ---- | From | @.> | | Date | 03/20/2024 22:22 | | To | JonJala/mtag @.> | | Cc | dqq0404 @.>, Comment @.> | | Subject | Re: [JonJala/mtag] Why do we need the column "n" in sumStats input when running MTAG (Issue #185) |

So the problem with using that formula is that it assumes that the phenotypic variance is one, which it won't be for a binary phenotype. For a binary phenotype, probably something like

N = (F(1- F))/(2p(1-p)*SE^2)

is better, but I'm hesitant to recommend it since I haven't tested it. If you do use it, I would do some stress testing to make sure it isn't producing results that don't make sense.

On Thu, Mar 14, 2024 at 9:42 PM dqq0404 @.***> wrote:

@paturley https://github.com/paturley Hi Patrick, I have a binary logistic regression GWAS Summary data with no columns for n, ncases and ncontrols. Can I use your formula: N = 1/(2p(1-p)*SE^2) to calculate the value of n for each SNP? Thanks, Qq

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-1998755985, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AFBUB5MJUEFC6MS6VR4GWMTYYJGZJAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYG42TKOJYGU

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-2009709432, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFBUB5IT4AWQTRGZ4JKHWHLYZGMMNAVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YDSNBTGI

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/JonJala/mtag/issues/185#issuecomment-2011170006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5LL7IYEX5SZQXRS753YZJKK3AVCNFSM6AAAAAA4PAZYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJRGE3TAMBQGY . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>