Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 15 forks source link

Add parameter for assuming minor allele is the effect allele #193

Closed Al-Murphy closed 1 month ago

Al-Murphy commented 2 months ago

The effect allele in a sumstats is not always the minor allele in a reference genome. As such, MSS does not assume that the effect allele is the minor allele. However this means that if the authors of sumstats mix up the naming of the A1 and A2 allele, MSS will not catch this mistake and will assume all effect columns relate to the major allele and will thus flip them to get the effect values relating to the minor allele. However if the columns were misnamed, we would not want the effect columns flipped.

I think an example of this is PMID:30846698, gwas data website: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST007001-GCST008000/GCST007560. Here they use ALLELE0 and ALLELE1 to be the non-effect and effect alleles respectively (taken from the README). However, the non-effect allele column are all minor alleles so we believe they misnamed these columns. However, MSS will not catch this as it does not assume the effect allele is the minor allele. An example SNP is rs10910006 which in the raw data had the reference allele as a G and the effect as an A but MSS flipped which is correct. So this means the effect values were also flipped by MSS.

#raw
#SNP   CHR      BP ALLELE1 ALLELE0      FRQ     INFO         BETA         SE     P
#1: rs10910006     1 3580666       A       G 0.842519 0.988623 -0.000474345 0.00100011  0.64
#reformatted
#SNP   CHR      BP     A1     A2       FRQ     INFO         BETA          SE     P
#rs10910006     1 3580666      A      G 0.1574810 0.988623  4.74345e-04 0.001000110 0.640

To catch cases where the columns were mislabelled we would need to assume the effect column is the minor allele. I think it would be worthwhile to have as a parameter: effect_is_mionor_allele (default to FALSE as I'm not sure if this is a sensible assumption) for MSS.

It should be easy enough to implement by just checking which of the two effect columns match the reference genome more and assigning that an unambiguous name like non-effect column and the other as effect column (see my 'on reference genome' and called in format_sumstats() and 'infer reference genome build' and called in format_sumstats(), checks in the package, it would be nearly identical to these). I think this check should be added on line 567 of format sumstats here.

Al-Murphy commented 2 months ago

It would be good to get other's opinions on this - @mykmal @sjfandrews @jonathangriffiths @rmgpanw

jonathangriffiths commented 2 months ago

Hi there - this does sound like a sensible thing to add. However, the number of ways that people can mess up sumstats is uncountable, and many of these ways are tricky to diagnose (often requiring downstream analyses like LDSC regression, or by looking at patterns of LD and effect sizes to identify flips).

So, while I think this is a sensible addition, I'm not sure how easily or reasonably a user would be able to make the assumption you describe about all minor alleles being the effect allele.

Keep up the good work on this package! Jonny

Al-Murphy commented 2 months ago

Cheers for the input @jonathangriffiths!

I guess the thing is that the assumption isn't All minor alleles being effect alleles only the majority. The check would just say if the designated A2 has more matches to the reference than the designated A1, swap the column naming. Then any cases that have the major allele as the effect allele would be flipped (along with their effect columns) to match the direction of the rest of the SNPs (as MSS does already). Again, this would be the default setting but I think it is a sensible choice for most cases.

rmgpanw commented 2 months ago

Hi thanks for tagging me - this sounds like an important issue and could be a really helpful new feature to implement. Could you please point me/us to where your 'on reference genome' and 'infer reference genome build' are located? Thanks

Al-Murphy commented 2 months ago

Sure @rmgpanw, updated in the original but also below. To note, I have added some checks already to help with ambiguous naming of the allele columns which are worth baring in mind, namely the infer_effect_column(), function here which should catch cases where the naming is ambiguous of the effect/non-effect allele such as A1/A2. Although these won't catch cases of mislabeling when not using naming like A1/A2.

'on reference genome' and called in format_sumstats()

'infer reference genome build' and called in format_sumstats(), checks in the package, it would be nearly identical to these.

Can discuss further if you like?

mykmal commented 2 months ago

Thanks for tagging me! I agree that this is an important issue, but I don't think that adding a new check/parameter is the best solution.

Last fall you added the infer_effect_column() function to MSS, which is enabled by setting the infer_eff_direction parameter. When this parameter is set to TRUE, the function will perform the following three steps:

  1. Check whether ambiguous naming conventions are used in the allele columns. We consider allele column names to be ambiguous if they contain the numbers 1 or 2. For example, allele columns named ALLELE1 and ALLELE2 would be considered ambiguous.
  2. If both allele columns are ambiguous, check whether the meaning of 1 and 2 can be inferred from other columns. For example, the presence of a column named ALLELE2_EFFECT_SIZE indicates that the ALLELE2 column contains effect alleles, and MSS will correctly infer this.
  3. If the reference and effect alleles cannot be inferred from other column names, then MSS will match both allele columns to the reference genome. The allele column with more matches will be interpreted to be the reference allele column.

From what I understand, the solution you are proposing in this issue is identical to the third check done by infer_effect_column() when infer_eff_direction = TRUE. Hence, I do not think that a new check or parameter is needed.

The reason that MSS is failing to correctly infer reference and effect alleles in your example GWAS is because of how we defined what counts as an ambiguous column name. In particular, MSS only considers column names with the numbers 1 and 2 to be ambiguous. Moreover, MSS only attempts to infer reference and effect alleles from other column names or from the reference genome if both of the original allele column names are ambiguous. In your example, however, the allele column names contain the numbers 0 and 1 instead. Only one of them is treated as ambiguous, so MSS does not try to match them to the reference genome. Rather than adding an entirely new check that would duplicate existing functionality, I suggest modifying the definition of ambiguous column names in infer_effect_column() to include the number 0 as well.

P.S. Those just joining this conversation might want to refer to the discussion in https://github.com/Al-Murphy/MungeSumstats/pull/168 where @Al-Murphy and I came up with the rationale for the three-step checks described above.

Al-Murphy commented 2 months ago

Hey @mykmal,

Thanks for this - I didn't even consider the third check of infer_effect_column()! I agree with modifying the definition of ambiguous column names in infer_effect_column() to include the number 0 as well.

This check is run by default - infer_eff_direction = TRUE which I agree with. However, I think it could be useful to have an option to run check 3 even if the column naming doesn't contain 0/1/2. This would cover cases where the columns have just been completely mislabeled e.g. effect allele column name given to other allele. By default though I think this should be set to False. Do we all agree on that? - @rmgpanw @jonathangriffiths

rmgpanw commented 2 months ago

Yes that sounds great to me!

mykmal commented 2 months ago

Hi @Al-Murphy,

If a user suspects that the columns in their GWAS summary statistics might be mislabeled despite having unambiguous names (e.g., a column named EFFECT_ALLELE is actually the non-effect allele), then they should carefully investigate that and rename the columns accordingly before running MSS. I don't think it's our job to try to catch such cases. Having said that, if you think that it would be useful to add a new option to manually enable the third check in infer_effect_column() regardless of what the column names are, I am fine with that as long as it is off by default. But the higher priority fix, in my opinion, is to make sure that column names containing the number 0 get treated as ambiguous.

Thanks again for bringing this up and for all your awesome work on MungeSumstats!

Al-Murphy commented 2 months ago

Yep, definitely off by default! I'll work on these additions today. Thanks for your opinions everyone!

Al-Murphy commented 2 months ago

Okay I believe I have captured all this in v1.13.7:

Bug fix

New features

To give an example using the same sumstats from above:

ss <- data.table::fread("~/Downloads/Dashti_30846698_long_sleep_duration_sumstats.txt.zip")
colnames(ss)[colnames(ss) == "P_LONGSLEEP"] <- "P"
colnames(ss)[colnames(ss) == "BETA_LONGSLEEP"] <- "BETA"
colnames(ss)[colnames(ss) == "SE_LONGSLEEP"] <- "SE"
#take small subset for speed
ss_small <- ss[sample(.N,100)]

Now run as normal with default settings:

format_ss <- MungeSumstats::format_sumstats(path=ss_small,
                                            ref_genome="GRCh37",dbSNP=144,
                                            infer_eff_direction=TRUE #the default
                                            )

This will infer the effect direction using the A1FREQ column (check 2). See the messages outputted by MSS:

#Infer Effect Column
#First line of summary statistics file: 
#  SNP  CHR BP  ALLELE1 ALLELE0 A1FREQ  INFO    BETA    SE  P   
#Allele columns are ambiguous, attempting to infer direction
#Found direction from effect/frq column naming
#
#Checking for correct direction of A1 (reference) and A2 (alternative allele).
#There are 99 SNPs where A1 doesn't match the reference genome.
#These will be flipped with their effect columns.
#

And the output sumstats:

#SNP   CHR        BP     A1     A2       FRQ     INFO         BETA          SE     P
#<char> <int>     <int> <char> <char>     <num>    <num>        <num>       <num> <num>
#  1:   rs3909666     1  72402494      T      C 0.1821710 0.992098 -0.000209885 0.000941797 0.840
#. 2:  rs10874455     1  85141895      A      T 0.9501607 0.994893  0.003479560 0.001670910 0.037
#. 3: rs192069370     1 156969759      C      T 0.0207730 0.994288 -0.000514802 0.002549940 0.850
#. 4:  rs13376258     1 166604396      A      C 0.0001590 0.964035 -0.035820700 0.031546000 0.270

So here, A1FREQ tells MSS this is the expected direction (A1 when A0/A1 are the alleles) so there is no need to rename the columns. So then the effect alleles get flipped since the newly named A1 (A0 to begin with) doesn't match the reference genome.

Now if we rerun and remove A1FREQ so MSS can't use this for inference, this should invoke check 3 where the reference genome is used to infer effect direction:

#remove inference from FREQ
ss_small2 <- copy(ss_small)
setnames(ss_small2,"A1FREQ","FREQ")
format_ss <- MungeSumstats::format_sumstats(path=ss_small2,
                                            ref_genome="GRCh37",dbSNP=144,
                                            infer_eff_direction=TRUE #the default
)

#output message:
#Effect/frq column(s) relate to A1 in the inputted sumstats
#Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
#
#Checking for correct direction of A1 (reference) and A2 (alternative allele).
#

#Output:
#SNP   CHR        BP     A1     A2       FRQ     INFO         BETA          SE     P
#<char> <int>     <int> <char> <char>     <num>    <num>        <num>       <num> <num>
#  1:   rs3909666     1  72402494      T      C 0.8178290 0.992098  0.000209885 0.000941797 0.840
#  2:  rs10874455     1  85141895      A      T 0.0498393 0.994893 -0.003479560 0.001670910 0.037
#  3: rs192069370     1 156969759      C      T 0.9792270 0.994288  0.000514802 0.002549940 0.850
#  4:  rs13376258     1 166604396      A      C 0.9998410 0.964035  0.035820700 0.031546000 0.270

Now we can see the reference genome check forced the renaming of A0/A1 and thereafter no flipping was needed. The effect is the effect alleles aren't flipped whereas for the first case they were - MSS inferred that the columns were mislabeled.

Finally, what happens using the new eff_on_minor_alleles parameter (which is off by default). So now check 3 will be invoked and the previous checks are ignored so it doesn't matter that A1FREQ is present:

#try with eff_on_minor_alleles
ss_small3 <- copy(ss_small)
format_ss <- MungeSumstats::format_sumstats(path=ss_small3,
                                            ref_genome="GRCh37",dbSNP=144,
                                            infer_eff_direction=TRUE, #the default
                                            eff_on_minor_alleles=TRUE # not the default
)
#Inferring direction from genome (check 3) - skips first 2 even though conditions are met like A1FREQ

#output message:
#Effect/frq column(s) relate to A1 in the inputted sumstats
#Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
#
#Checking for correct direction of A1 (reference) and A2 (alternative allele).
#

#Output:
#SNP   CHR        BP     A1     A2       FRQ     INFO         BETA          SE     P
#<char> <int>     <int> <char> <char>     <num>    <num>        <num>       <num> <num>
#  1:   rs3909666     1  72402494      T      C 0.8178290 0.992098  0.000209885 0.000941797 0.840
#  2:  rs10874455     1  85141895      A      T 0.0498393 0.994893 -0.003479560 0.001670910 0.037
#  3: rs192069370     1 156969759      C      T 0.9792270 0.994288  0.000514802 0.002549940 0.850
#  4:  rs13376258     1 166604396      A      C 0.9998410 0.964035  0.035820700 0.031546000 0.270

This gives the same result as the second example.

So overall, we think this sumstats was mislabelled which MSS can catch with the ambiguous allele check when check 2 doesn't catch it or when eff_on_minor_alleles is set to TRUE

Give it a try and let me know what you think!

bschilder commented 1 month ago

@Al-Murphy I agree with your approach above.

If a user suspects that the columns in their GWAS summary statistics might be mislabeled despite having unambiguous names (e.g., a column named EFFECT_ALLELE is actually the non-effect allele), then they should carefully investigate that and rename the columns accordingly before running MSS. I don't think it's our job to try to catch such cases.

@mykmal When working with one or a couple of GWAS I kind of agree with this. But for large-scale analyses (eg my project involves looking at thousands of GWAS datasets) manual inspection becomes impractical. Thus trying to automate as much as possible can be really helpful in these scenarios.

Al-Murphy commented 1 month ago

Thanks everyone for your input, closing this now but feel free to reopen if you want to discuss more.