Open htqqdd opened 3 months ago
Hi, It looks like you're using a very old version of MSS (v1.6.0), could you try and update to the current release of v1.12.0 and let me know if this sorts your issue first?
Thanks, Alan.
Thank you for replying so quickly. I apologize that I seem to be providing the wrong Session information. I have tried different versions of MungeSumstats and I found that this issue happened in the latest version: v1.13.2 and when I reinstalled old version 1.6.0 the issue does not seem to exist. I've checkced the example SNP in dbSNP and it seems the ref allele should be G, so perhaps nothing should be flipped, I hope this helps you find the problem in v1.13.2. Thank you agiain for this helpful package.
Hey,
So I was able to replicate your issue and have found the error. See below for a detailed explanation of what was going on. This is the data before running MSS.
SNP CHR BP A1 A2 FRQ BETA SE P
<char> <num> <num> <char> <char> <num> <num> <num> <num>
1: rs1182172 7 2838469 G A 0.361569 -0.0453284 0.00988289 4.5063e-06
First thing to note is that A1 and A2 naming is ambiguous as we don't know which of A1 and A2 the effect values (like beta) relate to. This is not consistent across studies and varies massively. Because of this, MSS first checks to see if the effect columns relate to A1 or A2. Note MSS needs to know this info as it assumes all effect values relate to A2 column, making it the effect allele, so we need to correct for this. This is checked in infer_effect_column()
. There are three checks made to infer which allele the effect/frequency information relates to:
So for this data, MSS found ambiguous naming based on check 1:
Allele columns are ambiguous, attempting to infer direction
It then looked for allele specific effect info but couldn't find any based on check 2. So it then goes to check 3 which is the final, 'last chance' check which checks against the reference genome to see which of A1 or A2 has more matches to the reference genome. There is the assumption here that the minor allele will be the effect allele which may not necessarily be the case which is why this is a final check.
MSS prints a few messages about this step:
Effect/frq column(s) relate to A1 in the inputted sumstats
Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome
So it would appear it found that A2 i.e. A for rs1182172 was on the reference genome - meaning it is the major allele. So this means MSS infers that the effect columns relate to A1, the minor allele. However, MSS always has effect info relating to A2 rather than A1 so it swaps the A1 and A2 alleles. We can see this in the output of infer_effect_column()
when I run the format_sumstats()
code function by function:
SNP CHR BP A2 A1 FRQ BETA SE P
<char> <num> <num> <char> <char> <num> <num> <num> <num>
1: rs1182172 7 2838469 G A 0.361569 -0.0453284 0.00988289 4.5063e-06
So you can see that at this stage, A2 and A1 naming has been flipped but the effect columns stay the same. MSS relates these effect columns to A2 allele, G.
However finding the effect columns relating to A1 was a bug - it actually found it relates to A2. I have now implemented a fix and when we rerun the data with MSS we get the formatted output as:
SNP CHR BP A1 A2 FRQ BETA SE P
<char> <int> <int> <char> <char> <num> <num> <num> <num>
1: rs1182172 7 2838469 G A 0.361569 -0.0453284 0.00988289 4.5063e-06
This is available in MSS v1.13.3 and release v1.12.1. Thank you for bringing this to my attention. Please do try the new version out and make sure it works on your end too?
Code:
ss <- data.table("SNP"="rs1182172","CHR"=7,"BP"=2838469,"A1"="G", "A2"="A",
"FRQ"=0.361569,"BETA"=-0.0453284, "SE"=0.00988289,
"P"=4.5063e-06)
formatted_ss <- MungeSumstats::format_sumstats(ss,return_data = TRUE)
1. Bug description
The FRQ and BETA is flipped as expected, but A1 and A2 keeps original
Console output
**::NOTE::**
tempdir()
by default.save_path
( e.g.save_path=file.path('./formatted',basename(path))
), or make sure to copy files elsewhere after processing ( e.g.file.copy(save_path, './formatted/' )
.Formatted summary statistics will be saved to ==> /tmp/Rtmp9w0Cmk/file2426307a5ba18f.tsv.gz Infer Effect Column First line of summary statistics file: SNP CHR BP A1 A2 FRQ BETA SE P
Allele columns are ambiguous, attempting to infer direction Standardising column headers. First line of summary statistics file: SNP CHR BP A1 A2 FRQ BETA SE P
Loading SNPlocs data. Loading reference genome data. Preprocessing RSIDs. Validating RSIDs of 1 SNPs using BSgenome::snpsById... BSgenome::snpsById done in 10 seconds. Effect/frq column(s) relate to A1 in the inputted sumstats Found direction from matchine reference genome - NOTE this assumes non-effect allele will macth the reference genome Standardising column headers. First line of summary statistics file: SNP CHR BP A2 A1 FRQ BETA SE P
Summary statistics report:
Expected behaviour
I think when flip the FRQ and BETA, the A1 and A2 should be exchanged right?
2. Reproducible example
Here is my example gwas data:
And here is the formmated data
3. Session info
R version 4.2.2 Patched (2022-11-10 r83330) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] metafor_3.8-1 metadat_1.2-0 Matrix_1.6-1.1 TwoSampleMR_0.6.6 biomaRt_2.54.0
[6] MungeSumstats_1.6.0 dplyr_1.0.10 data.table_1.14.6
loaded via a namespace (and not attached): [1] nlme_3.1-160 bitops_1.0-7
[3] matrixStats_0.63.0 fs_1.5.2
[5] bit64_4.0.5 filelock_1.0.2
[7] progress_1.2.2 httr_1.4.4
[9] GenomeInfoDb_1.34.3 googleAuthR_2.0.0
[11] tools_4.2.2 utf8_1.2.2
[13] R6_2.5.1 DBI_1.1.3
[15] BiocGenerics_0.44.0 withr_2.5.0
[17] mnormt_2.1.1 tidyselect_1.2.0
[19] prettyunits_1.1.1 bit_4.0.5
[21] curl_4.3.3 compiler_4.2.2
[23] cli_3.6.3 Biobase_2.58.0
[25] xml2_1.3.3 DelayedArray_0.24.0
[27] rtracklayer_1.65.0 psych_2.2.9
[29] rappdirs_0.3.3 stringr_1.5.1
[31] digest_0.6.30 Rsamtools_2.14.0
[33] R.utils_2.12.2 XVector_0.38.0
[35] pkgconfig_2.0.3 MatrixGenerics_1.10.0
[37] dbplyr_2.2.1 fastmap_1.1.0
[39] BSgenome_1.66.3 rlang_1.1.3
[41] rstudioapi_0.14 RSQLite_2.2.18
[43] BiocIO_1.8.0 generics_0.1.3
[45] jsonlite_1.8.3 BiocParallel_1.32.1
[47] R.oo_1.25.0 VariantAnnotation_1.44.0
[49] RCurl_1.98-1.9 magrittr_2.0.3
[51] GenomeInfoDbData_1.2.9 Rcpp_1.0.11
[53] S4Vectors_0.36.2 fansi_1.0.3
[55] lifecycle_1.0.3 R.methodsS3_1.8.2
[57] stringi_1.7.8 yaml_2.3.6
[59] mathjaxr_1.6-0 SummarizedExperiment_1.28.0
[61] zlibbioc_1.44.0 plyr_1.8.8
[63] BiocFileCache_2.6.0 grid_4.2.2
[65] blob_1.2.3 crayon_1.5.2
[67] lattice_0.20-45 Biostrings_2.66.0
[69] GenomicFeatures_1.50.2 hms_1.1.2
[71] KEGGREST_1.38.0 pillar_1.8.1
[73] GenomicRanges_1.50.1 rjson_0.2.21
[75] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.23 [77] codetools_0.2-18 stats4_4.2.2
[79] XML_3.99-0.12 glue_1.6.2
[81] BiocManager_1.30.19 png_0.1-7
[83] vctrs_0.6.5 purrr_1.0.2
[85] tidyr_1.3.1 assertthat_0.2.1
[87] cachem_1.0.6 restfulr_0.0.15
[89] gargle_1.2.1 tibble_3.1.8
[91] GenomicAlignments_1.34.0 AnnotationDbi_1.60.0
[93] memoise_2.0.1 IRanges_2.32.0
[95] ellipsis_0.3.2