Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 15 forks source link

Allele flip seems not work #188

Open htqqdd opened 3 months ago

htqqdd commented 3 months ago

1. Bug description

The FRQ and BETA is flipped as expected, but A1 and A2 keeps original

Console output

**::NOTE::**

Formatted summary statistics will be saved to ==> /tmp/Rtmp9w0Cmk/file2426307a5ba18f.tsv.gz Infer Effect Column First line of summary statistics file: SNP CHR BP A1 A2 FRQ BETA SE P
Allele columns are ambiguous, attempting to infer direction Standardising column headers. First line of summary statistics file: SNP CHR BP A1 A2 FRQ BETA SE P
Loading SNPlocs data. Loading reference genome data. Preprocessing RSIDs. Validating RSIDs of 1 SNPs using BSgenome::snpsById... BSgenome::snpsById done in 10 seconds. Effect/frq column(s) relate to A1 in the inputted sumstats Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome Standardising column headers. First line of summary statistics file: SNP CHR BP A2 A1 FRQ BETA SE P
Summary statistics report:

Expected behaviour

I think when flip the FRQ and BETA, the A1 and A2 should be exchanged right?

2. Reproducible example

Here is my example gwas data: image

And here is the formmated data image

3. Session info

R version 4.2.2 Patched (2022-11-10 r83330) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] metafor_3.8-1 metadat_1.2-0 Matrix_1.6-1.1 TwoSampleMR_0.6.6 biomaRt_2.54.0
[6] MungeSumstats_1.6.0 dplyr_1.0.10 data.table_1.14.6

loaded via a namespace (and not attached): [1] nlme_3.1-160 bitops_1.0-7
[3] matrixStats_0.63.0 fs_1.5.2
[5] bit64_4.0.5 filelock_1.0.2
[7] progress_1.2.2 httr_1.4.4
[9] GenomeInfoDb_1.34.3 googleAuthR_2.0.0
[11] tools_4.2.2 utf8_1.2.2
[13] R6_2.5.1 DBI_1.1.3
[15] BiocGenerics_0.44.0 withr_2.5.0
[17] mnormt_2.1.1 tidyselect_1.2.0
[19] prettyunits_1.1.1 bit_4.0.5
[21] curl_4.3.3 compiler_4.2.2
[23] cli_3.6.3 Biobase_2.58.0
[25] xml2_1.3.3 DelayedArray_0.24.0
[27] rtracklayer_1.65.0 psych_2.2.9
[29] rappdirs_0.3.3 stringr_1.5.1
[31] digest_0.6.30 Rsamtools_2.14.0
[33] R.utils_2.12.2 XVector_0.38.0
[35] pkgconfig_2.0.3 MatrixGenerics_1.10.0
[37] dbplyr_2.2.1 fastmap_1.1.0
[39] BSgenome_1.66.3 rlang_1.1.3
[41] rstudioapi_0.14 RSQLite_2.2.18
[43] BiocIO_1.8.0 generics_0.1.3
[45] jsonlite_1.8.3 BiocParallel_1.32.1
[47] R.oo_1.25.0 VariantAnnotation_1.44.0
[49] RCurl_1.98-1.9 magrittr_2.0.3
[51] GenomeInfoDbData_1.2.9 Rcpp_1.0.11
[53] S4Vectors_0.36.2 fansi_1.0.3
[55] lifecycle_1.0.3 R.methodsS3_1.8.2
[57] stringi_1.7.8 yaml_2.3.6
[59] mathjaxr_1.6-0 SummarizedExperiment_1.28.0
[61] zlibbioc_1.44.0 plyr_1.8.8
[63] BiocFileCache_2.6.0 grid_4.2.2
[65] blob_1.2.3 crayon_1.5.2
[67] lattice_0.20-45 Biostrings_2.66.0
[69] GenomicFeatures_1.50.2 hms_1.1.2
[71] KEGGREST_1.38.0 pillar_1.8.1
[73] GenomicRanges_1.50.1 rjson_0.2.21
[75] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.23 [77] codetools_0.2-18 stats4_4.2.2
[79] XML_3.99-0.12 glue_1.6.2
[81] BiocManager_1.30.19 png_0.1-7
[83] vctrs_0.6.5 purrr_1.0.2
[85] tidyr_1.3.1 assertthat_0.2.1
[87] cachem_1.0.6 restfulr_0.0.15
[89] gargle_1.2.1 tibble_3.1.8
[91] GenomicAlignments_1.34.0 AnnotationDbi_1.60.0
[93] memoise_2.0.1 IRanges_2.32.0
[95] ellipsis_0.3.2

Al-Murphy commented 3 months ago

Hi, It looks like you're using a very old version of MSS (v1.6.0), could you try and update to the current release of v1.12.0 and let me know if this sorts your issue first?

Thanks, Alan.

htqqdd commented 3 months ago

Thank you for replying so quickly. I apologize that I seem to be providing the wrong Session information. I have tried different versions of MungeSumstats and I found that this issue happened in the latest version: v1.13.2 and when I reinstalled old version 1.6.0 the issue does not seem to exist. I've checkced the example SNP in dbSNP and it seems the ref allele should be G, so perhaps nothing should be flipped, I hope this helps you find the problem in v1.13.2. Thank you agiain for this helpful package.

Al-Murphy commented 3 months ago

Hey,

So I was able to replicate your issue and have found the error. See below for a detailed explanation of what was going on. This is the data before running MSS.

         SNP   CHR      BP     A1     A2      FRQ       BETA         SE          P
      <char> <num>   <num> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06

First thing to note is that A1 and A2 naming is ambiguous as we don't know which of A1 and A2 the effect values (like beta) relate to. This is not consistent across studies and varies massively. Because of this, MSS first checks to see if the effect columns relate to A1 or A2. Note MSS needs to know this info as it assumes all effect values relate to A2 column, making it the effect allele, so we need to correct for this. This is checked in infer_effect_column(). There are three checks made to infer which allele the effect/frequency information relates to:

  1. Check if ambiguous naming conventions are used (i.e. allele 1 and 2 or equivalent). If not exit, otherwise continue to next checks. This can be checked by using the mapping file and splitting A1/A2 mappings by those that contain 1 or 2 (ambiguous) or doesn't contain 1 or 2 e.g. effect, tested (unambiguous so fine for MSS to handle as is).
  2. Look for effect column/frequency column where the A1/A2 explicitly mentioned, if found then we know the direction and should update A1/A2 naming so A2 is the effect column. We can look for such columns by getting every combination of A1/A2 naming and effect/frq naming.
  3. If not found in 2, a final check should be against the reference genome, whichever of A1 and A2 has more of a match with the reference genome should be taken as not the effect allele. There is an assumption in this but is still better than guessing the ambiguous allele naming.

So for this data, MSS found ambiguous naming based on check 1:

Allele columns are ambiguous, attempting to infer direction

It then looked for allele specific effect info but couldn't find any based on check 2. So it then goes to check 3 which is the final, 'last chance' check which checks against the reference genome to see which of A1 or A2 has more matches to the reference genome. There is the assumption here that the minor allele will be the effect allele which may not necessarily be the case which is why this is a final check.

MSS prints a few messages about this step:

Effect/frq column(s) relate to A1 in the inputted sumstats
Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome

So it would appear it found that A2 i.e. A for rs1182172 was on the reference genome - meaning it is the major allele. So this means MSS infers that the effect columns relate to A1, the minor allele. However, MSS always has effect info relating to A2 rather than A1 so it swaps the A1 and A2 alleles. We can see this in the output of infer_effect_column() when I run the format_sumstats() code function by function:

         SNP   CHR      BP     A2     A1      FRQ       BETA         SE          P
      <char> <num>   <num> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06

So you can see that at this stage, A2 and A1 naming has been flipped but the effect columns stay the same. MSS relates these effect columns to A2 allele, G.

However finding the effect columns relating to A1 was a bug - it actually found it relates to A2. I have now implemented a fix and when we rerun the data with MSS we get the formatted output as:

         SNP   CHR      BP     A1     A2      FRQ       BETA         SE          P
      <char> <int>   <int> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06

This is available in MSS v1.13.3 and release v1.12.1. Thank you for bringing this to my attention. Please do try the new version out and make sure it works on your end too?

Code:

ss <- data.table("SNP"="rs1182172","CHR"=7,"BP"=2838469,"A1"="G", "A2"="A", 
                 "FRQ"=0.361569,"BETA"=-0.0453284, "SE"=0.00988289, 
                 "P"=4.5063e-06)

formatted_ss <- MungeSumstats::format_sumstats(ss,return_data = TRUE)