Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

`format_sumstats()` duplicates base letter between `A1` and `A2` #186

Closed rmgpanw closed 5 months ago

rmgpanw commented 5 months ago

1. Bug description

format_sumstats() duplicates base letter for A1 and A2 with certain input column headers

Console output

Please see reprex

2. Reproducible example

Code

input <- tibble::tribble(
   ~RSID,     ~BETA,      ~SE,   ~CHISQ,       ~P, ~EA, ~OA,  ~LOG10P,
   "rs1", -0.013225, 0.106049, 0.044829, 0.767306, "A", "T", 0.115031,
   "rs2",  0.007146, 0.101943, 0.025325, 0.884257, "T", "A", 0.053421,
   "rs3",  0.000349, 0.069104, 0.020021, 0.584778, "G", "C", 0.233009,
   "rs4",  0.031215, 0.003513, 0.190675, 0.085615, "A", "T",  1.06745,
   "rs5", -0.009978, 0.111421, 0.148457, 0.852646, "C", "G", 0.069231,
   "rs6",   0.01134,  0.69912, 0.128844, 0.341856, "A", "T", 0.466157,
   "rs7",  0.050145, 0.196447, 0.249215,  0.84104, "T", "A", 0.075183,
   "rs8", -0.041927, 0.160814, 0.386157, 0.640864, "G", "C", 0.193234,
   "rs9", -0.024046, 0.188661, 0.261336,  0.79117, "A", "T",  0.10173,
  "rs10",  -0.03573, 0.301911, 0.220919, 0.783168, "C", "G", 0.106145
  )

MungeSumstats::format_sumstats(
  path = input,

  ref_genome = "GRCh37",

  nThread = 1
)
#> 
#> 
#> ******::NOTE::******
#>  - Formatted results will be saved to `tempdir()` by default.
#>  - This means all formatted summary stats will be deleted upon ending the R session.
#>  - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
#>  ********************
#> Formatted summary statistics will be saved to ==>  /tmp/RtmpPHIiPQ/file119027f32d14f.tsv.gz
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'SNPlocs.Hsapiens.dbSNP155.GRCh37'
#> Standardising column headers.
#> First line of summary statistics file:
#> RSID BETA    SE  CHISQ   P   EA  OA  LOG10P  
#> Summary statistics report:
#>    - 10 rows
#>    - 10 unique variants
#>    - 0 genome-wide significant variants (P<5e-8)
#> Checking for multi-GWAS.
#> Checking for multiple RSIDs on one row.
#> Checking SNP RSIDs.
#> Checking for merged allele column.
#> Checking A2 is uppercase
#> Summary statistics file does not have obvious CHR/BP columns. Checking to see if they are joined in another column.
#> Standardising column headers.
#> First line of summary statistics file:
#> SNP  BETA    SE  CHISQ   P   A2  OA  LOG10P  
#> Loading SNPlocs data.
#> There is no Chromosome or Base Pair Position column found within the data. It must be inferred from other column information.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 10 SNPs using BSgenome::snpsById...
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> BSgenome::snpsById done in 110 seconds.
#> Ensuring all SNPs are on the reference genome.
#> There is no A1 or A2 allele information column found within the data. It must be inferred from  other column information.
#> One of A1/A2 are missing, allele flipping will be tested
#> Deriving A1 from reference genome
#> Checking for correct direction of A1 (reference) and A2 (alternative allele).
#> There are 4 SNPs where A1 doesn't match the reference genome.
#> These will be flipped with their effect columns.
#> Checking for missing data.
#> Checking for duplicate columns.
#> Checking for duplicate SNPs from SNP ID.
#> Checking for SNPs with duplicated base-pair positions.
#> INFO column not available. Skipping INFO score filtering step.
#> Filtering SNPs, ensuring SE>0.
#> Ensuring all SNPs have N<5 std dev above mean.
#> Removing 'chr' prefix from CHR.
#> Making X/Y/MT CHR uppercase.
#> Checking for bi-allelic SNPs.
#> 1 SNPs are non-biallelic. These will be removed.
#> Warning: When method is an integer, must be >0.
#> Sorting coordinates with 'data.table'.
#> Writing in tabular format ==> /tmp/RtmpPHIiPQ/file119027f32d14f.tsv.gz
#> Summary statistics report:
#>    - 6 rows (60% of original 10 rows)
#>    - 6 unique variants
#>    - 0 genome-wide significant variants (P<5e-8)
#>    - 2 chromosomes
#> Done munging in 1.839 minutes.
#> Successfully finished preparing sumstats file, preview:
#> Reading header.
#>    SNP CHR       BP A1 A2      BETA       SE    CHISQ        P OA   LOG10P
#> 1: rs6   7 91747131  A  A -0.011340 0.699120 0.128844 0.341856  T 0.466157
#> 2: rs7   7 91779557  T  T -0.050145 0.196447 0.249215 0.841040  A 0.075183
#> 3: rs5   7 91839110  C  C  0.009978 0.111421 0.148457 0.852646  G 0.069231
#> 4: rs8   7 92408329  C  G -0.041927 0.160814 0.386157 0.640864  C 0.193234
#> Returning path to saved data.
#> [1] "/tmp/RtmpPHIiPQ/file119027f32d14f.tsv.gz"

Created on 2024-05-01 with reprex v2.0.2

Expected behaviour

Expect output from above to:

3. Session info

``` > utils::sessionInfo() R version 4.3.0 (2023-04-21) Platform: x86_64-slackware-linux-gnu (64-bit) Running under: Slackware 14.2 x86_64 (post 14.2 -current) Matrix products: default BLAS: /usr/lib64/R/lib/libRblas.so LAPACK: /usr/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: Europe/London tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] reprex_2.0.2 GenomeInfoDb_1.36.1 IRanges_2.34.1 S4Vectors_0.38.1 BiocGenerics_0.46.0 data.table_1.14.10 lubridate_1.9.2 forcats_1.0.0 [9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 [17] targets_1.2.0 loaded via a namespace (and not attached): [1] BiocIO_1.10.0 bitops_1.0-7 filelock_1.0.3 R.oo_1.25.0 [5] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.24 SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24 XML_3.99-0.16 [9] lifecycle_1.0.4 sf_1.0-15 mixsqp_0.3-54 rprojroot_2.0.4 [13] globals_0.16.2 processx_3.8.3 lattice_0.21-8 vroom_1.6.5 [17] backports_1.4.1 magrittr_2.0.3 rmarkdown_2.25 yaml_2.3.8 [21] DBI_1.2.0 zlibbioc_1.46.0 rvest_1.0.3 GenomicRanges_1.52.0 [25] R.utils_2.12.3 RCurl_1.98-1.13 VariantAnnotation_1.46.0 rappdirs_0.3.3 [29] GenomeInfoDbData_1.2.10 irlba_2.3.5.1 listenv_0.9.0 units_0.8-5 [33] parallelly_1.36.0 codetools_0.2-19 DelayedArray_0.26.3 xml2_1.3.6 [37] tidyselect_1.2.1 farver_2.1.1 viridis_0.6.4 matrixStats_1.0.0 [41] BiocFileCache_2.8.0 GenomicAlignments_1.36.0 jsonlite_1.8.7 e1071_1.7-14 [45] tools_4.3.0 progress_1.2.3 Rcpp_1.0.11 ggVennDiagram_1.2.2 [49] glue_1.7.0 gridExtra_2.3 xfun_0.41 here_1.0.1 [53] MatrixGenerics_1.12.2 withr_3.0.0 fastmap_1.1.1 LDlinkR_1.3.0 [57] fansi_1.0.6 callr_3.7.3 digest_0.6.31 timechange_0.2.0 [61] R6_2.5.1 colorspace_2.1-0 biomaRt_2.56.1 RSQLite_2.3.4 [65] R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3 renv_0.15.5 [69] rtracklayer_1.60.0 class_7.3-22 prettyunits_1.2.0 httr_1.4.7 [73] S4Arrays_1.0.4 pkgconfig_2.0.3 gtable_0.3.4 blob_1.2.4 [77] XVector_0.40.0 tarchetypes_0.7.7 furrr_0.3.1 htmltools_0.5.5 [81] susieR_0.12.35 base64url_1.4 future.callr_0.8.2 scales_1.3.0 [85] Biobase_2.60.0 png_0.1-8 knitr_1.45 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [89] rstudioapi_0.15.0 tzdb_0.4.0 rjson_0.2.21 curl_5.2.1 [93] proxy_0.4-27 cachem_1.0.8 RVenn_1.1.0 KernSmooth_2.23-22 [97] parallel_4.3.0 AnnotationDbi_1.62.1 restfulr_0.0.15 pillar_1.9.0 [101] grid_4.3.0 reshape_0.8.9 vctrs_0.6.5 dbplyr_2.5.0 [105] evaluate_0.23 GenomicFeatures_1.52.1 cli_3.6.2 compiler_4.3.0 [109] Rsamtools_2.16.0 rlang_1.1.3 crayon_1.5.2 labeling_0.4.3 [113] classInt_0.4-10 ps_1.7.5 plyr_1.8.8 fs_1.6.3 [117] stringi_1.8.3 coloc_5.2.3 viridisLite_0.4.2 BiocParallel_1.34.2 [121] assertthat_0.2.1 munsell_0.5.0 Biostrings_2.68.1 Matrix_1.5-4 [125] MungeSumstats_1.8.0 BSgenome_1.68.0 hms_1.1.3 bit64_4.0.5 [129] future_1.33.1 KEGGREST_1.40.0 SummarizedExperiment_1.30.2 googleAuthR_2.0.1 [133] corrcoverage_1.2.1 gargle_1.5.2 igraph_1.6.0 memoise_2.0.1 [137] datapasta_3.1.0 bit_4.0.5 ```
Al-Murphy commented 5 months ago

Hey! This is a simple fix, the issue is that OA is not present in the column header mapping file:

data("sumstatsColHeaders")
#look in this dataframe
sumstatsColHeaders

MSS then tried to input A1 from the reference genome and that's what you are seeing in the A1 column not just duplicates.

I'm surprised I hadn't caught this matching for A1 before but just shows the real lack of standardisation in column naming. Here is the output with this added:

      SNP   CHR       BP     A1     A2      BETA       SE    CHISQ        P   LOG10P
   <char> <int>    <int> <char> <char>     <num>    <num>    <num>    <num>    <num>
1:    rs6     7 91747131      A      T -0.011340 0.699120 0.128844 0.341856 0.466157
2:    rs7     7 91779557      T      A -0.050145 0.196447 0.249215 0.841040 0.075183
3:    rs5     7 91839110      C      G  0.009978 0.111421 0.148457 0.852646 0.069231
4:    rs8     7 92408329      C      G -0.041927 0.160814 0.386157 0.640864 0.193234

I've added this in v 1.13.1, this is the new devel branch (Bioc 3.20) that was just released over the past few days. If yuou don't want to swap to this version just add OA -> A1 to sumstatsColHeaders on your own version, details here: https://github.com/neurogenomics/MungeSumstats/blob/master/R/data.R