Closed jksull closed 1 year ago
For good measure, I set the allele_flip_check
flag to FALSE
:
Before munging
variant minor_allele minor_AF expected_case_minor_AC low_confidence_variant n_complete_samples AC ytx beta se tstat pval
1:15791:C:T T 5.42862e-09 3.91078e-05 true 361194 3.92157e-03 0.00000e+00 -3.56322e+00 2.53027e+01 -1.40823e-01 8.88009e-01
1:69487:G:A A 5.74891e-06 4.14151e-02 true 361194 4.15294e+00 0.00000e+00 -1.13132e-02 4.95620e-02 -2.28264e-01 8.19441e-01
1:69569:T:C C 1.87738e-04 1.35246e+00 true 361194 1.35620e+02 9.72549e-01 -2.70834e-03 8.88871e-03 -3.04695e-01 7.60599e-01
After munging
SNP CHR BP A1 A2 FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT N AC YTX BETA SE TSTAT P
rs547522712 1 15791 T C 0.00000000542862 0.0000391078 TRUE 361194 0.00392157 0 -3.56322 25.3027 -0.140823 0.888009
rs568226429 1 69487 A G 0.00000574891 0.0414151 TRUE 361194 4.15294 0 -0.0113132 0.049562 -0.228264 0.819441
rs2531267 1 69569 C T 0.000187738 1.35246 TRUE 361194 135.62 0.972549 -0.00270834 0.00888871 -0.304695 0.760599
You can see that the FRQ
and beta
are no longer flipped, but the A1
allele and A2
allele are incorrectly assigned (i.e. they have been reversed). It seems like with this dataset, MSS cannot properly recognise the columns?
Apologies for the added comments, but I also thought it's imperative to mention that this dataset is taken directly from the commonly used Neale Lab GWAS summary statistics, run on UK Biobank data. You can download the full file using this command:
wget "https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/N20.gwas.imputed_v3.both_sexes.tsv.bgz" -O N20.gwas.imputed_v3.both_sexes.tsv.bgz
Hey!
Great to hear you find the package useful. Thank you for the very detailed bug description, it made things quite easy to investigate.
So the behaviour you are noting isn't necessarily a bug per se. I set up MSS so that if a column with 1:69569:T:C was detected it would be taken as CHR:BP:A2:A1
. This is what this printed message meant:
Column ID has been separated into the columns CHR, BP, A2, A1
As you can imagine there is no way to know whether the inputted data is CHR:BP:A2:A1 or CHR:BP:A1:A2 so I added the message to at least explain the assumption.
Another solution which is already in MSS is to rename the column, giving the order. See below, I just copied the 9 rows in your example to a csv:
#if you rename variant it will work as expected
samp_ss <- data.table::fread("~/Downloads/samp_dat_ss.csv")
samp_ss[,MAF_ORG:=minor_AF]
data.table::setnames(samp_ss,"variant","CHR:BP:A1:A2")
a2 <- MungeSumstats::format_sumstats(samp_ss,
dbSNP = 144,
ref_genome = "GRCh37")
munged2 <- data.table::fread(a2)
munged2
SNP CHR BP A1 A2 FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT N AC YTX BETA
1: rs547522712 1 15791 C T 5.43e-09 3.91e-05 TRUE 361194 3.92e-03 0.000 -3.560000
2: rs568226429 1 69487 G A 5.75e-06 4.14e-02 TRUE 361194 4.15e+00 0.000 -0.011300
3: rs2531267 1 69569 T C 1.88e-04 1.35e+00 TRUE 361194 1.36e+02 0.973 -0.002710
4: rs533633326 1 139853 C T 5.66e-06 4.08e-02 TRUE 361194 4.09e+00 0.000 -0.011100
5: rs12238997 1 693731 A G 1.16e-01 8.34e+02 FALSE 361194 8.37e+04 805.000 -0.000390
6: rs371890604 1 707522 G C 9.73e-02 7.01e+02 FALSE 361194 7.03e+04 659.000 -0.000722
7: rs144155419 1 717587 G A 1.57e-02 1.13e+02 FALSE 361194 1.13e+04 116.000 0.000313
8: rs189787166 1 723329 A T 1.73e-03 1.25e+01 TRUE 361194 1.25e+03 8.510 -0.003470
SE TSTAT P MAF_ORG
1: 2.53e+01 -0.141 0.8880 5.43e-09
2: 4.96e-02 -0.228 0.8190 5.75e-06
3: 8.89e-03 -0.305 0.7610 1.88e-04
4: 4.96e-02 -0.225 0.8220 5.66e-06
5: 3.88e-04 -1.010 0.3140 1.16e-01
6: 4.36e-04 -1.660 0.0978 9.73e-02
7: 1.04e-03 0.301 0.7630 1.57e-02
8: 3.07e-03 -1.130 0.2580 1.73e-03
However, since you also have a separate minor allele (A2) column supplied, I think MSS could be a little smarter and use this to infer the order of CHR:BP:A2:A1 or CHR:BP:A1:A2. This has been implemented in the dev version of MSS (1.7.11) which you can use immediately by installing directly from Github. This update results in the expected behaviour:
samp_ss <- data.table::fread("~/Downloads/samp_dat_ss.csv")
samp_ss[,MAF_ORG:=minor_AF]
a <- MungeSumstats::format_sumstats(samp_ss,
dbSNP = 144,
ref_genome = "GRCh37")
munged <- data.table::fread(a)
munged
SNP CHR BP A1 A2 FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT N AC YTX BETA
1: rs547522712 1 15791 C T 5.43e-09 3.91e-05 TRUE 361194 3.92e-03 0.000 -3.560000
2: rs568226429 1 69487 G A 5.75e-06 4.14e-02 TRUE 361194 4.15e+00 0.000 -0.011300
3: rs2531267 1 69569 T C 1.88e-04 1.35e+00 TRUE 361194 1.36e+02 0.973 -0.002710
4: rs533633326 1 139853 C T 5.66e-06 4.08e-02 TRUE 361194 4.09e+00 0.000 -0.011100
5: rs12238997 1 693731 A G 1.16e-01 8.34e+02 FALSE 361194 8.37e+04 805.000 -0.000390
6: rs371890604 1 707522 G C 9.73e-02 7.01e+02 FALSE 361194 7.03e+04 659.000 -0.000722
7: rs144155419 1 717587 G A 1.57e-02 1.13e+02 FALSE 361194 1.13e+04 116.000 0.000313
8: rs189787166 1 723329 A T 1.73e-03 1.25e+01 TRUE 361194 1.25e+03 8.510 -0.003470
SE TSTAT P MAF_ORG
1: 2.53e+01 -0.141 0.8880 5.43e-09
2: 4.96e-02 -0.228 0.8190 5.75e-06
3: 8.89e-03 -0.305 0.7610 1.88e-04
4: 4.96e-02 -0.225 0.8220 5.66e-06
5: 3.88e-04 -1.010 0.3140 1.16e-01
6: 4.36e-04 -1.660 0.0978 9.73e-02
7: 1.04e-03 0.301 0.7630 1.57e-02
8: 3.07e-03 -1.130 0.2580 1.73e-03
Thanks for the quick response! I can see what you mean. I did think that if nothing else, we could expect the format CHR:BP:A1:A2
to be the standard, as it is in line with the SPDI Notation (mostly), but if I've learnt anything recently it's not to trust GWAS Summary statistics..ever!
Hey! Fair comment, I've updated so the default is CHR:BP:A1:A2
too (v 1.7.12
). Let me know if this isn't working as expected
Thanks a lot! I think that makes sense, and will let you know if I see any issues :)
Hey @Al-Murphy , just checking that the behaviour you describe in your first response as part of 1.7.11
:
"However, since you also have a separate minor allele (A2) column supplied, I think MSS could be a little smarter and use this to infer the order of CHR:BP:A2:A1 or CHR:BP:A1:A2."
Is still the expected behaviour in the 1.7.12
? And additionally, does it infer on a variant-by-variant (i.e. row-by-row) manner?
It appears that SNPs within the same summary stats file can be in either order:
e.g.
variant minor_allele minor_AF
1:69569:T:C C 0.0156927
1:753405:C:A C 0.12976
Which would make this imputation (if it is applied on the entire df) incorrect for some SNPs, with little way of being certain.
The problem I am envisioning if the order is inferred for the entire dataframe, is that the second SNP in the example above will flip the minor_allele
to become the A1
, despite it being explicit that the T
is the minor allele, if I understand correctly?
Hey! Just to clarify, the use of an additional column for A1/A2 that I described as being implemented in 1.7.11 is still there in 1.7.12. The check works by simply looking at the first row in the sumstats and checking the A2 column there to find a match. I suppose it might be better to sample more rows and do a majority vote (feel free to make a pull request with this if you think it's necessary for your use case). This check is however overwritten though if the column is named as something like CHR:BP:A1:A2.
The example you described would be very worrying if it is the case. If there were row differences where the A1 and A2 values order change but their MAF and Beta columns aren't flipped along with these, I don't see how it could ever be accounted for? Think about the example where we have:
SNP CHR BP A1 A2 BETA FRQ
rs0001 1 15791 C T 0.013 5.43e-09
rs0002 1 69487 G A -0.018 5e-06
If however, rs0001 had A1 = major allele and rs0002 had A1 = minor allele, we would need to assume that the BETA and FRQ values are not in the same direction i.e. rs0001 FRQ describes the minor allele frequency or MAF and for rs0002 it is the major allele frequency or vice versa. Then we can flip the alleles for one of these and also flip the FRQ and BETA values. This idea of consistent direction of these effect and frequency columns is paramount.
"This idea of consistent direction of these effect and frequency columns is paramount."
I completely agree, but I'm afraid this doesn't seem to be the case.
I have to admit, I'm a little confused. But allow me to provide some further context.
In the above example (using MSS 1.7.8
), the resulting munged stats looks something like this:
SNP CHR BP A1 A2 FREQ
rs25131267 1 69569 T C 0.0156927
rs3115860 1 753405 C A 0.1297
As you can see, the minor_allele
for the second SNP is inferred as the A1
allele (but notice how the FREQ hasn't been flipped).
So, based on your response above, 1.7.11
won't ignore the variant
column in favour of the minor_allele
, on a SNP-by-SNP basis, but instead only on the first row (and therefore if this inconsistent direction is present, the resulting munged stats won't be correct)?
I can think of two solutions to the problem
minor_allele
is actually A1
, then flip the beta and the FREQ accordingly (if these flags are set to TRUE).minor_allele
is actually the A1
allele (after finding it on the REF), and flip the other values accordingly (again, of course only if these are set to TRUE).As I understand, the latter part of option 2 (i.e. _perform the allele check to decide whether the minor_allele
is actually the A1
allele, and flip the other values accordingly._) is already part of MSS, but it would be necessary that it does not infer the order based on just the first row, but rather to perform the rsID imputation (based on CHR and BP) separately to deciding whether the minor_allele
matches the REF (i.e. is A1
).
There are of course underlying knowledge of the specific dataset that must be understood regarding whether we want to flip the BETA
as well (some datasets provide a minor_allele
, a minor_AF
, but state explicitly that the minor allele can be the REF allele, and yet they provide the BETA
relative to the effect allele - not relative to the minor_allele
).
And regarding this:
"If however, rs0001 had A1 = major allele and rs0002 had A1 = minor allele, we would need to assume that the BETA and FRQ values are not in the same direction i.e. rs0001 FRQ describes the minor allele frequency or MAF and for rs0002 it is the major allele frequency or vice versa"
This is exactly the issue, and a seemingly a common format of summary statistics. If you look at the following column description in the very commonly used Neale Lab data, you can see this is the case :(
and
So the resulting summary statistics from which this thread was started:
variant minor_allele minor_AF expected_case_minor_AC low_confidence_variant n_complete_samples AC ytx beta se tstat pval
1:15791:C:T T 5.42862e-09 3.91078e-05 true 361194 3.92157e-03 0.00000e+00 -3.56322e+00 2.53027e+01 -1.40823e-01 8.88009e-01
1:69487:G:A A 5.74891e-06 4.14151e-02 true 361194 4.15294e+00 0.00000e+00 -1.13132e-02 4.95620e-02 -2.28264e-01 8.19441e-01
1:69569:T:C C 1.87738e-04 1.35246e+00 true 361194 1.35620e+02 9.72549e-01 -2.70834e-03 8.88871e-03 -3.04695e-01 7.60599e-01
1:139853:C:T T 5.66205e-06 4.07894e-02 true 361194 4.09020e+00 0.00000e+00 -1.11344e-02 4.95635e-02 -2.24648e-01 8.22253e-01
1:692794:CA:C C 1.10640e-01 7.97048e+02 false 361194 7.99247e+04 7.74761e+02 -3.29738e-04 4.10525e-04 -8.03211e-01 4.21853e-01
1:693731:A:G G 1.15830e-01 8.34441e+02 false 361194 8.36744e+04 8.05114e+02 -3.90192e-04 3.87896e-04 -1.00592e+00 3.14455e-01
1:707522:G:C C 9.73034e-02 7.00973e+02 false 361194 7.02908e+04 6.59459e+02 -7.21990e-04 4.36074e-04 -1.65566e+00 9.77915e-02
1:717587:G:A A 1.56880e-02 1.13017e+02 false 361194 1.13329e+04 1.16090e+02 3.13409e-04 1.04059e-03 3.01185e-01 7.63274e-01
1:723329:A:T T 1.73363e-03 1.24891e+01 true 361194 1.25235e+03 8.50980e+00 -3.47324e-03 3.07012e-03 -1.13130e+00 2.57929e-01
...has both directions of alleles, and the minor_AF
is relative only to the minor_allele
, where the "alt allele is not always minor.
It's important to not that the BETA
is relative to the effect allele
, which is also not necessarily the minor_allele
... (my head hurts).
Regarding how the minor_AF
is calculated based on the AF
in the original variant annotation file )the
variants.tsv.bgz` (first photo), they explain it in FAQ:
And for the GWAS statistics themselves:
Back to MSS, assuming the variant
order of CHR:BP:A1:A2
, perhaps this could just be solved by checking whether the minor_allele
corresponds to the A1
(and if so make sure that the FREQ > 0.5), and likewise if the minor_allele
corresponds to the A2
, then the FREQ
must be < 0.5
. To do this there must be a distinction that minor_allele
does not necessarily map to A2
- as evident by the dataset I have linked.
Hey! A few things in this that are worth pointing out:
Check the order of the SNP column based on the minor allele column on a SNP-by-SNP basis. If the minor_allele is actually A1, then flip the beta and the FREQ accordingly (if these flags are set to TRUE).
I don't think this is the best idea because what we actually want is to identify the alternative allele not the minor allele in MSS as this is the allele that the effect columns relate to i.e. the allele that is being tested. This may not be the minor allele in all instances (as you show). I'll explain the best practice on this going forward later.
Have option to ignore the SNP column entirely, firstly allowing MSS to impute the rsID based on CHR and BP (I believe this should be possible as it is?), and also perform the allele check to decide whether the minor_allele is actually the A1 allele (after finding it on the REF), and flip the other values accordingly (again, of course only if these are set to TRUE). This is possible since an RS ID relates to the combination of BP and CHR. So I mean that if the allele is non-bi-allelic, both alternative alleles will have the same RS ID. However, this would cause the same issue as above - we want the alternative allele not the minor allele in MSS as this is the allele that the effect columns relate to. Getting A1/A2 from a reference dataset would not guarantee this (plus if you have any non-bi-allelic alleles, these would cause issues too).
Once we have the alternative allele, MSS will then flip the necessary SNPs (in a SNP by SNP basis) to ensure A1 is the reference or major allele and A2 is the minor. It will also flip the FRQ and effect columns. It is worth noting that, in this way, MSS will only flip the effects in the correct columns.
Looking at the screen grabs of the sumstats in question, I think there is a simple solution to ensure we identify the alternative allele. Essentially, we want MSS to pick up AF as the FRQ value since this relates to the FRQ of the alternative allele studied (whether that be the minor or not). The beta (i.e. the effect column) is already relating to the alternative allele. So as long as MSS picks up this, the SNP flipping will work correctly. MSS already has this as a value for FRQ (see data("sumstatsColHeaders")
for all the mappings). And if you pass MSS a sumstats with AF and minor_allele it will take AF as the FRQ value (i.e. the correct one). See below:
a <- MungeSumstats::formatted_example()
data.table::setnames(a,'FRQ','AF')
a[,minor_allele:=AF]
b <- MungeSumstats::format_sumstats(a)
data.table::fread(b)
> data.table::fread(b)
SNP CHR BP A1 A2 FRQ BETA SE P MINOR_ALLELE
1: rs34305371 1 72733610 G A 0.91231 -0.035 0.005 3.762e-14 0.08769
2: rs1008078 1 91189731 C T 0.62690 0.016 0.003 6.005e-10 0.37310
3: rs61787263 1 98618714 T C 0.76120 0.016 0.003 5.391e-08 0.76120
4: rs2992632 1 243503764 A T 0.66040 0.017 0.003 8.227e-09 0.66040
Note even if this wasn't the case though you can change the sumstatsColHeaders
reference as you need. Hope this makes sense? I agree the whole sumstats standardisation does start to melt your head (really this is what drove me to make MSS).
Also, I'm reopening this issue, not because I plan to work on any alternative functionality based on this but more so, so others are aware of it and we can come back to it at a the future stage!
I believe the subsequent discussions on related issues covered what's going on here so I'm going to close for now.
Cheers, Alan.
Firstly I would just like to thank you for you continued efforts in helping us all deal with the nightmare of non-standardised GWAS summary statistics. I've recently come across this tool and I believe it has the potential to make my (and many others) work enormously more efficient.
My tool versions are:
1.6.0
3.16.0
4.2.2
1. Bug description
Using the following raw summary statistics dataset, the MSS flips the allele frequency (and beta) when it doesn't seem like it should:
Before munging
After munging (Default parameters,
dbSNP=144
,ref_genome = "GRCh37"
):As you can see, the
FRQ
column has flipped, so has the sign of thebeta
. The reason that they are flipped (logically-speaking) is because MSS thinks thatThere are 8 SNPs where A1 doesn't match the reference genome
. I understand that when the SNPs don't match the reference, they may be flipped together with theFRQ
andbeta
, but in this example it is clear that the allele's themselves don't appear to have been flipped (at least according to the resulting data tables) - nor should they be!See the log here:
I have confirmed that the
minor_allele
in the original dataset is in fact the minor allele, orA2
as MSS would see it. This was confirmed through manual search in dbSNP as well as can be inferred from thevariant
column (i.e.1:15791:C:T
in the first row relates to theminor_allele
T
, with the major allele expected to be assigned asC
by MSS).I tried the following additional tests to see if I could enforce the expected behaviour
I will paste only the first three SNPs of each of the following tables (both before and after munging), to keep things as concise as possible
1. Standardised the header prior to munging using
standardise_header
Before munging
After munging
2. Manually changed the name of
minor_allele
column toA2
, prior to mungingBefore munging
After munging
3. Manually changed the name of
minor_allele
column toA2
AND manually added aA1
column with the corresponding correct reference alleleBefore munging
After munging
The log from this final test still explicitly states that
There are 8 SNPs where A1 doesn't match the reference genome.
. See full log below:I apologise if I'm doing something wrong, but I tried to be extensive with my testing and exhaustive in my research into the matter within your docs!
3. Session info