Closed Al-Murphy closed 1 month ago
It would be good to get other's opinions on this - @mykmal @sjfandrews @jonathangriffiths @rmgpanw
Hi there - this does sound like a sensible thing to add. However, the number of ways that people can mess up sumstats is uncountable, and many of these ways are tricky to diagnose (often requiring downstream analyses like LDSC regression, or by looking at patterns of LD and effect sizes to identify flips).
So, while I think this is a sensible addition, I'm not sure how easily or reasonably a user would be able to make the assumption you describe about all minor alleles being the effect allele.
Keep up the good work on this package! Jonny
Cheers for the input @jonathangriffiths!
I guess the thing is that the assumption isn't All minor alleles being effect alleles only the majority. The check would just say if the designated A2 has more matches to the reference than the designated A1, swap the column naming. Then any cases that have the major allele as the effect allele would be flipped (along with their effect columns) to match the direction of the rest of the SNPs (as MSS does already). Again, this would be the default setting but I think it is a sensible choice for most cases.
Hi thanks for tagging me - this sounds like an important issue and could be a really helpful new feature to implement. Could you please point me/us to where your 'on reference genome' and 'infer reference genome build' are located? Thanks
Sure @rmgpanw, updated in the original but also below. To note, I have added some checks already to help with ambiguous naming of the allele columns which are worth baring in mind, namely the infer_effect_column()
, function here which should catch cases where the naming is ambiguous of the effect/non-effect allele such as A1/A2. Although these won't catch cases of mislabeling when not using naming like A1/A2.
'on reference genome' and called in format_sumstats()
'infer reference genome build' and called in format_sumstats()
, checks in the package, it would be nearly identical to these.
Can discuss further if you like?
Thanks for tagging me! I agree that this is an important issue, but I don't think that adding a new check/parameter is the best solution.
Last fall you added the infer_effect_column() function to MSS, which is enabled by setting the infer_eff_direction
parameter. When this parameter is set to TRUE
, the function will perform the following three steps:
From what I understand, the solution you are proposing in this issue is identical to the third check done by infer_effect_column()
when infer_eff_direction = TRUE
. Hence, I do not think that a new check or parameter is needed.
The reason that MSS is failing to correctly infer reference and effect alleles in your example GWAS is because of how we defined what counts as an ambiguous column name. In particular, MSS only considers column names with the numbers 1 and 2 to be ambiguous. Moreover, MSS only attempts to infer reference and effect alleles from other column names or from the reference genome if both of the original allele column names are ambiguous. In your example, however, the allele column names contain the numbers 0 and 1 instead. Only one of them is treated as ambiguous, so MSS does not try to match them to the reference genome. Rather than adding an entirely new check that would duplicate existing functionality, I suggest modifying the definition of ambiguous column names in infer_effect_column()
to include the number 0 as well.
P.S. Those just joining this conversation might want to refer to the discussion in https://github.com/Al-Murphy/MungeSumstats/pull/168 where @Al-Murphy and I came up with the rationale for the three-step checks described above.
Hey @mykmal,
Thanks for this - I didn't even consider the third check of infer_effect_column()
! I agree with modifying the definition of ambiguous column names in infer_effect_column() to include the number 0 as well.
This check is run by default - infer_eff_direction = TRUE
which I agree with. However, I think it could be useful to have an option to run check 3 even if the column naming doesn't contain 0/1/2. This would cover cases where the columns have just been completely mislabeled e.g. effect allele column name given to other allele. By default though I think this should be set to False. Do we all agree on that? - @rmgpanw @jonathangriffiths
Yes that sounds great to me!
Hi @Al-Murphy,
If a user suspects that the columns in their GWAS summary statistics might be mislabeled despite having unambiguous names (e.g., a column named EFFECT_ALLELE is actually the non-effect allele), then they should carefully investigate that and rename the columns accordingly before running MSS. I don't think it's our job to try to catch such cases. Having said that, if you think that it would be useful to add a new option to manually enable the third check in infer_effect_column()
regardless of what the column names are, I am fine with that as long as it is off by default. But the higher priority fix, in my opinion, is to make sure that column names containing the number 0 get treated as ambiguous.
Thanks again for bringing this up and for all your awesome work on MungeSumstats!
Yep, definitely off by default! I'll work on these additions today. Thanks for your opinions everyone!
Okay I believe I have captured all this in v1.13.7:
infer_eff_direction
now includes A0 as an ambiguous case as well as A1/A2.eff_on_minor_alleles
parameter added (off by default) - controls whether
MungeSumstats should assume that the effects are majoritively measured on the
minor alleles. Default is FALSE as this is an assumption that won't be
appropriate in all cases. However, the benefit is that if we know the majority
of SNPs have their effects based on the minor alleles, we can catch cases where
the allele columns have been mislabelled. To give an example using the same sumstats from above:
ss <- data.table::fread("~/Downloads/Dashti_30846698_long_sleep_duration_sumstats.txt.zip")
colnames(ss)[colnames(ss) == "P_LONGSLEEP"] <- "P"
colnames(ss)[colnames(ss) == "BETA_LONGSLEEP"] <- "BETA"
colnames(ss)[colnames(ss) == "SE_LONGSLEEP"] <- "SE"
#take small subset for speed
ss_small <- ss[sample(.N,100)]
Now run as normal with default settings:
format_ss <- MungeSumstats::format_sumstats(path=ss_small,
ref_genome="GRCh37",dbSNP=144,
infer_eff_direction=TRUE #the default
)
This will infer the effect direction using the A1FREQ column (check 2). See the messages outputted by MSS:
#Infer Effect Column
#First line of summary statistics file:
# SNP CHR BP ALLELE1 ALLELE0 A1FREQ INFO BETA SE P
#Allele columns are ambiguous, attempting to infer direction
#Found direction from effect/frq column naming
#
#Checking for correct direction of A1 (reference) and A2 (alternative allele).
#There are 99 SNPs where A1 doesn't match the reference genome.
#These will be flipped with their effect columns.
#
And the output sumstats:
#SNP CHR BP A1 A2 FRQ INFO BETA SE P
#<char> <int> <int> <char> <char> <num> <num> <num> <num> <num>
# 1: rs3909666 1 72402494 T C 0.1821710 0.992098 -0.000209885 0.000941797 0.840
#. 2: rs10874455 1 85141895 A T 0.9501607 0.994893 0.003479560 0.001670910 0.037
#. 3: rs192069370 1 156969759 C T 0.0207730 0.994288 -0.000514802 0.002549940 0.850
#. 4: rs13376258 1 166604396 A C 0.0001590 0.964035 -0.035820700 0.031546000 0.270
So here, A1FREQ tells MSS this is the expected direction (A1 when A0/A1 are the alleles) so there is no need to rename the columns. So then the effect alleles get flipped since the newly named A1 (A0 to begin with) doesn't match the reference genome.
Now if we rerun and remove A1FREQ so MSS can't use this for inference, this should invoke check 3 where the reference genome is used to infer effect direction:
#remove inference from FREQ
ss_small2 <- copy(ss_small)
setnames(ss_small2,"A1FREQ","FREQ")
format_ss <- MungeSumstats::format_sumstats(path=ss_small2,
ref_genome="GRCh37",dbSNP=144,
infer_eff_direction=TRUE #the default
)
#output message:
#Effect/frq column(s) relate to A1 in the inputted sumstats
#Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
#
#Checking for correct direction of A1 (reference) and A2 (alternative allele).
#
#Output:
#SNP CHR BP A1 A2 FRQ INFO BETA SE P
#<char> <int> <int> <char> <char> <num> <num> <num> <num> <num>
# 1: rs3909666 1 72402494 T C 0.8178290 0.992098 0.000209885 0.000941797 0.840
# 2: rs10874455 1 85141895 A T 0.0498393 0.994893 -0.003479560 0.001670910 0.037
# 3: rs192069370 1 156969759 C T 0.9792270 0.994288 0.000514802 0.002549940 0.850
# 4: rs13376258 1 166604396 A C 0.9998410 0.964035 0.035820700 0.031546000 0.270
Now we can see the reference genome check forced the renaming of A0/A1 and thereafter no flipping was needed. The effect is the effect alleles aren't flipped whereas for the first case they were - MSS inferred that the columns were mislabeled.
Finally, what happens using the new eff_on_minor_alleles
parameter (which is off by default). So now check 3 will be invoked and the previous checks are ignored so it doesn't matter that A1FREQ is present:
#try with eff_on_minor_alleles
ss_small3 <- copy(ss_small)
format_ss <- MungeSumstats::format_sumstats(path=ss_small3,
ref_genome="GRCh37",dbSNP=144,
infer_eff_direction=TRUE, #the default
eff_on_minor_alleles=TRUE # not the default
)
#Inferring direction from genome (check 3) - skips first 2 even though conditions are met like A1FREQ
#output message:
#Effect/frq column(s) relate to A1 in the inputted sumstats
#Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
#
#Checking for correct direction of A1 (reference) and A2 (alternative allele).
#
#Output:
#SNP CHR BP A1 A2 FRQ INFO BETA SE P
#<char> <int> <int> <char> <char> <num> <num> <num> <num> <num>
# 1: rs3909666 1 72402494 T C 0.8178290 0.992098 0.000209885 0.000941797 0.840
# 2: rs10874455 1 85141895 A T 0.0498393 0.994893 -0.003479560 0.001670910 0.037
# 3: rs192069370 1 156969759 C T 0.9792270 0.994288 0.000514802 0.002549940 0.850
# 4: rs13376258 1 166604396 A C 0.9998410 0.964035 0.035820700 0.031546000 0.270
This gives the same result as the second example.
So overall, we think this sumstats was mislabelled which MSS can catch with the ambiguous allele check when check 2 doesn't catch it or when eff_on_minor_alleles
is set to TRUE
Give it a try and let me know what you think!
@Al-Murphy I agree with your approach above.
If a user suspects that the columns in their GWAS summary statistics might be mislabeled despite having unambiguous names (e.g., a column named EFFECT_ALLELE is actually the non-effect allele), then they should carefully investigate that and rename the columns accordingly before running MSS. I don't think it's our job to try to catch such cases.
@mykmal When working with one or a couple of GWAS I kind of agree with this. But for large-scale analyses (eg my project involves looking at thousands of GWAS datasets) manual inspection becomes impractical. Thus trying to automate as much as possible can be really helpful in these scenarios.
Thanks everyone for your input, closing this now but feel free to reopen if you want to discuss more.
The effect allele in a sumstats is not always the minor allele in a reference genome. As such, MSS does not assume that the effect allele is the minor allele. However this means that if the authors of sumstats mix up the naming of the A1 and A2 allele, MSS will not catch this mistake and will assume all effect columns relate to the major allele and will thus flip them to get the effect values relating to the minor allele. However if the columns were misnamed, we would not want the effect columns flipped.
I think an example of this is PMID:30846698, gwas data website: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST007001-GCST008000/GCST007560. Here they use ALLELE0 and ALLELE1 to be the non-effect and effect alleles respectively (taken from the README). However, the non-effect allele column are all minor alleles so we believe they misnamed these columns. However, MSS will not catch this as it does not assume the effect allele is the minor allele. An example SNP is rs10910006 which in the raw data had the reference allele as a G and the effect as an A but MSS flipped which is correct. So this means the effect values were also flipped by MSS.
To catch cases where the columns were mislabelled we would need to assume the effect column is the minor allele. I think it would be worthwhile to have as a parameter: effect_is_mionor_allele (default to FALSE as I'm not sure if this is a sensible assumption) for MSS.
It should be easy enough to implement by just checking which of the two effect columns match the reference genome more and assigning that an unambiguous name like non-effect column and the other as effect column (see my 'on reference genome' and called in
format_sumstats()
and 'infer reference genome build' and called informat_sumstats()
, checks in the package, it would be nearly identical to these). I think this check should be added on line 567 of format sumstats here.