`format_sumstats()` removes rows with unneeded columns that contain `NA` values

rmgpanw commented 7 months ago

1. Bug description

Any column that contains NA values will cause those rows to be removed, even if the column is not necessary

Expected behaviour

Remove rows with missing values, but only if those missing values are in essential columns (e.g. "CHR", "POS", "BETA" etc)

2. Reproducible example

Dummy data, including an "EXTRA" column containing NAs:

library(MungeSumstats)

# dummy sumstats - 'EXTRA' column is all `NA`
sumstats <- tibble::tribble(
                  ~CHR,     ~BP,         ~SNP, ~ALLELE0, ~ALLELE1,   ~A1FREQ,    ~INFO,    ~N, ~TEST,     ~BETA,      ~SE,   ~CHISQ,  ~LOG10P, ~EXTRA,       ~P,      ~MAF,
                    1L, 768448L, "rs12562034",      "A",      "G",   0.88078,        1, 1103L, "ADD",  0.031726, 0.073348, 0.271736, 0.272182,     NA, 0.534341,   0.11922,
                    1L, 779322L,  "rs4040617",      "G",      "A",  0.864007,        1, 1103L, "ADD",  0.045622, 0.188145, 0.375913, 0.173862,     NA, 0.670098,  0.135993,
                    1L, 846808L,  "rs4475691",      "T",      "C",  0.787725, 0.999199, 1103L, "ADD", -0.021146, 0.223333, 0.216948,  0.15269,     NA, 0.703574,  0.212275,
                    1L, 854250L,  "rs7537756",      "G",      "A",  0.780543, 0.999645, 1103L, "ADD", -0.038013, 0.329377, 0.181226, 0.536016,     NA, 0.291061,  0.219457,
                    1L, 863124L,  "rs4040604",      "T",      "G", 0.0227028, 0.998387, 1103L, "ADD",  0.005006, 0.357811, 0.528155,  0.91415,     NA, 0.121857, 0.0227028,
                    1L, 870645L, "rs28576697",      "C",      "T",  0.709148, 0.960746, 1103L, "ADD",  0.013644, 0.089277, 0.169717, 0.482062,     NA, 0.329563,  0.290852,
                    1L, 879317L,  "rs7523549",      "T",      "C",     0.976, 0.983244, 1103L, "ADD", -0.077923, 0.139721, 0.197346, 0.538192,     NA, 0.289606,     0.024,
                    1L, 894573L, "rs13303010",      "A",      "G", 0.0965549,        1, 1103L, "ADD",  0.036839, 0.540228, 0.203537, 0.064934,     NA, 0.861125, 0.0965549,
                    1L, 918573L,  "rs2341354",      "G",      "A",  0.398912,        1, 1103L, "ADD", -0.007327, 0.224223, 0.182912, 0.050807,     NA, 0.889597,  0.398912,
                    1L, 940203L, "rs35940137",      "A",      "G",  0.937443,        1, 1103L, "ADD", -0.034904, 0.462789, 0.008603, 0.254472,     NA, 0.556581,  0.062557
                  )

All rows are removed by format_sumstats() due to the NA values, even though this "EXTRA" column is not needed:

# Formatted including 'EXTRA' column of `NA` values
MungeSumstats::format_sumstats(sumstats,
                               ref_genome = "GRCh37", 
                               return_data = TRUE)
#> 
#> 
#> ******::NOTE::******
#>  - Formatted results will be saved to `tempdir()` by default.
#>  - This means all formatted summary stats will be deleted upon ending the R session.
#>  - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
#>  ********************
#> Formatted summary statistics will be saved to ==>  /tmp/RtmpCMNKUv/file191365fac0c81.tsv.gz
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'SNPlocs.Hsapiens.dbSNP155.GRCh37'
#> Standardising column headers.
#> First line of summary statistics file:
#> CHR  BP  SNP ALLELE0 ALLELE1 A1FREQ  INFO    N   TEST    BETA    SE  CHISQ   LOG10P  EXTRA   P   MAF 
#> Summary statistics report:
#>    - 10 rows
#>    - 10 unique variants
#>    - 0 genome-wide significant variants (P<5e-8)
#>    - 1 chromosomes
#> Checking for multi-GWAS.
#> Checking for multiple RSIDs on one row.
#> Checking SNP RSIDs.
#> Checking for merged allele column.
#> Checking A1 is uppercase
#> Checking A2 is uppercase
#> Checking for incorrect base-pair positions
#> Ensuring all SNPs are on the reference genome.
#> Loading SNPlocs data.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 10 SNPs using BSgenome::snpsById...
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> BSgenome::snpsById done in 104 seconds.
#> Checking for correct direction of A1 (reference) and A2 (alternative allele).
#> There are 10 SNPs where A1 doesn't match the reference genome.
#> These will be flipped with their effect columns.
#> Reordering so first three column headers are SNP, CHR and BP in this order.
#> Reordering so the fourth and fifth columns are A1 and A2.
#> Checking for missing data.
#> WARNING: 10 rows in sumstats file are missing data and will be removed.
#> Error in check_miss_data(sumstats_dt = sumstats_return$sumstats_dt, path = path, : All SNPs have been filtered out of  your summary statistics dataset

Works as expected as long as "EXTRA" column is removed:

# Formatted excluding 'EXTRA' column of `NA` values
sumstats |>
  dplyr::select(-EXTRA) |>
  MungeSumstats::format_sumstats(ref_genome = "GRCh37",
                                 return_data = TRUE)
#> 
#> 
#> ******::NOTE::******
#>  - Formatted results will be saved to `tempdir()` by default.
#>  - This means all formatted summary stats will be deleted upon ending the R session.
#>  - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
#>  ********************
#> Formatted summary statistics will be saved to ==>  /tmp/RtmpCMNKUv/file19136306a59cb.tsv.gz
#> Standardising column headers.
#> First line of summary statistics file:
#> CHR  BP  SNP ALLELE0 ALLELE1 A1FREQ  INFO    N   TEST    BETA    SE  CHISQ   LOG10P  P   MAF 
#> Summary statistics report:
#>    - 10 rows
#>    - 10 unique variants
#>    - 0 genome-wide significant variants (P<5e-8)
#>    - 1 chromosomes
#> Checking for multi-GWAS.
#> Checking for multiple RSIDs on one row.
#> Checking SNP RSIDs.
#> Checking for merged allele column.
#> Checking A1 is uppercase
#> Checking A2 is uppercase
#> Checking for incorrect base-pair positions
#> Ensuring all SNPs are on the reference genome.
#> Loading SNPlocs data.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 10 SNPs using BSgenome::snpsById...
#> BSgenome::snpsById done in 48 seconds.
#> Checking for correct direction of A1 (reference) and A2 (alternative allele).
#> There are 10 SNPs where A1 doesn't match the reference genome.
#> These will be flipped with their effect columns.
#> Reordering so first three column headers are SNP, CHR and BP in this order.
#> Reordering so the fourth and fifth columns are A1 and A2.
#> Checking for missing data.
#> Checking for duplicate columns.
#> Checking for duplicate SNPs from SNP ID.
#> Checking for SNPs with duplicated base-pair positions.
#> Filtering SNPs based on INFO score.
#> All rows have INFO>=0.9
#> Filtering SNPs, ensuring SE>0.
#> Ensuring all SNPs have N<5 std dev above mean.
#> Removing 'chr' prefix from CHR.
#> Making X/Y/MT CHR uppercase.
#> Checking for bi-allelic SNPs.
#> N already exists within sumstats_dt.
#> 10 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
#> The FRQ column was mapped from one of the following from the inputted  summary statistics file:
#> FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
#> As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
#> set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
#> Sorting coordinates with 'data.table'.
#> Writing in tabular format ==> /tmp/RtmpCMNKUv/file19136306a59cb.tsv.gz
#> Summary statistics report:
#>    - 10 rows (100% of original 10 rows)
#>    - 10 unique variants
#>    - 0 genome-wide significant variants (P<5e-8)
#>    - 1 chromosomes
#> Done munging in 0.812 minutes.
#> Successfully finished preparing sumstats file, preview:
#> Reading header.
#>           SNP CHR     BP A1 A2   A1FREQ     INFO    N TEST      BETA       SE
#> 1: rs12562034   1 768448  G  A 0.880780 1.000000 1103  ADD -0.031726 0.073348
#> 2:  rs4040617   1 779322  A  G 0.864007 1.000000 1103  ADD -0.045622 0.188145
#> 3:  rs4475691   1 846808  C  T 0.787725 0.999199 1103  ADD  0.021146 0.223333
#> 4:  rs7537756   1 854250  A  G 0.780543 0.999645 1103  ADD  0.038013 0.329377
#>       CHISQ   LOG10P        P      FRQ
#> 1: 0.271736 0.272182 0.534341 0.880780
#> 2: 0.375913 0.173862 0.670098 0.864007
#> 3: 0.216948 0.152690 0.703574 0.787725
#> 4: 0.181226 0.536016 0.291061 0.780543
#> Returning data directly.
#>            SNP CHR     BP A1 A2    A1FREQ     INFO    N TEST      BETA       SE
#>  1: rs12562034   1 768448  G  A 0.8807800 1.000000 1103  ADD -0.031726 0.073348
#>  2:  rs4040617   1 779322  A  G 0.8640070 1.000000 1103  ADD -0.045622 0.188145
#>  3:  rs4475691   1 846808  C  T 0.7877250 0.999199 1103  ADD  0.021146 0.223333
#>  4:  rs7537756   1 854250  A  G 0.7805430 0.999645 1103  ADD  0.038013 0.329377
#>  5:  rs4040604   1 863124  G  T 0.0227028 0.998387 1103  ADD -0.005006 0.357811
#>  6: rs28576697   1 870645  T  C 0.7091480 0.960746 1103  ADD -0.013644 0.089277
#>  7:  rs7523549   1 879317  C  T 0.9760000 0.983244 1103  ADD  0.077923 0.139721
#>  8: rs13303010   1 894573  G  A 0.0965549 1.000000 1103  ADD -0.036839 0.540228
#>  9:  rs2341354   1 918573  A  G 0.3989120 1.000000 1103  ADD  0.007327 0.224223
#> 10: rs35940137   1 940203  G  A 0.9374430 1.000000 1103  ADD  0.034904 0.462789
#>        CHISQ   LOG10P        P       FRQ
#>  1: 0.271736 0.272182 0.534341 0.8807800
#>  2: 0.375913 0.173862 0.670098 0.8640070
#>  3: 0.216948 0.152690 0.703574 0.7877250
#>  4: 0.181226 0.536016 0.291061 0.7805430
#>  5: 0.528155 0.914150 0.121857 0.9772972
#>  6: 0.169717 0.482062 0.329563 0.7091480
#>  7: 0.197346 0.538192 0.289606 0.9760000
#>  8: 0.203537 0.064934 0.861125 0.9034451
#>  9: 0.182912 0.050807 0.889597 0.6010880
#> 10: 0.008603 0.254472 0.556581 0.9374430

^{Created on 2024-04-22 with reprex v2.0.2}

3. Session info

``` R version 4.3.0 (2023-04-21) Platform: x86_64-slackware-linux-gnu (64-bit) Running under: Slackware 14.2 x86_64 (post 14.2 -current) Matrix products: default BLAS: /usr/lib64/R/lib/libRblas.so LAPACK: /usr/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: Europe/London tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] reprex_2.0.2 GenomeInfoDb_1.36.1 IRanges_2.34.1 S4Vectors_0.38.1 BiocGenerics_0.46.0 gtexr_0.0.0.9000 [7] otargen_1.1.0 flextable_0.9.4 targets_1.2.0 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.1 [13] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.4.4 [19] tidyverse_2.0.0 loaded via a namespace (and not attached): [1] later_1.3.2 BiocIO_1.10.0 filelock_1.0.3 [4] bitops_1.0-7 R.oo_1.25.0 SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.24 [7] XML_3.99-0.16 lifecycle_1.0.4 sf_1.0-15 [10] gtsummary_1.7.2 mixsqp_0.3-54 rprojroot_2.0.4 [13] globals_0.16.2 processx_3.8.3 lattice_0.21-8 [16] backports_1.4.1 magrittr_2.0.3 plotly_4.10.3 [19] rmarkdown_2.25 yaml_2.3.8 httpuv_1.6.13 [22] zip_2.3.0 askpass_1.2.0 DBI_1.2.0 [25] zlibbioc_1.46.0 rvest_1.0.3 GenomicRanges_1.52.0 [28] R.utils_2.12.3 RCurl_1.98-1.13 rappdirs_0.3.3 [31] VariantAnnotation_1.46.0 gdtools_0.3.5 reactable_0.4.4 [34] GenomeInfoDbData_1.2.10 ggrepel_0.9.4 irlba_2.3.5.1 [37] listenv_0.9.0 crul_1.4.0 units_0.8-5 [40] parallelly_1.36.0 DelayedArray_0.26.3 codetools_0.2-19 [43] xml2_1.3.6 tidyselect_1.2.1 httpcode_0.3.0 [46] farver_2.1.1 viridis_0.6.4 BiocFileCache_2.8.0 [49] matrixStats_1.0.0 GenomicAlignments_1.36.0 broom.helpers_1.14.0 [52] jsonlite_1.8.8 e1071_1.7-14 ellipsis_0.3.2 [55] systemfonts_1.0.5 progress_1.2.3 tools_4.3.0 [58] ragg_1.2.7 Rcpp_1.0.11 ggVennDiagram_1.2.2 [61] glue_1.7.0 gridExtra_2.3 xfun_0.41 [64] here_1.0.1 MatrixGenerics_1.12.2 withr_3.0.0 [67] fastmap_1.1.1 LDlinkR_1.3.0 fansi_1.0.6 [70] openssl_2.1.1 callr_3.7.3 digest_0.6.33 [73] timechange_0.2.0 R6_2.5.1 mime_0.12 [76] textshaping_0.3.7 colorspace_2.1-0 biomaRt_2.56.1 [79] RSQLite_2.3.4 R.methodsS3_1.8.2 utf8_1.2.4 [82] generics_0.1.3 fontLiberation_0.1.0 renv_0.15.5 [85] data.table_1.14.10 rtracklayer_1.60.0 class_7.3-22 [88] prettyunits_1.2.0 S4Arrays_1.0.4 httr_1.4.7 [91] htmlwidgets_1.6.2 pkgconfig_2.0.3 gtable_0.3.4 [94] blob_1.2.4 XVector_0.40.0 tarchetypes_0.7.7 [97] furrr_0.3.1 htmltools_0.5.5 fontBitstreamVera_0.1.1 [100] susieR_0.12.35 graphql_1.5.1 base64url_1.4 [103] future.callr_0.8.2 Biobase_2.60.0 scales_1.3.0 [106] png_0.1-8 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 knitr_1.45 [109] rstudioapi_0.15.0 rjson_0.2.21 tzdb_0.4.0 [112] ghql_0.1.0 uuid_1.1-1 curl_5.2.0 [115] cachem_1.0.8 proxy_0.4-27 RVenn_1.1.0 [118] KernSmooth_2.23-22 parallel_4.3.0 AnnotationDbi_1.62.1 [121] restfulr_0.0.15 pillar_1.9.0 grid_4.3.0 [124] reshape_0.8.9 vctrs_0.6.5 promises_1.2.1 [127] dbplyr_2.5.0 xtable_1.8-4 evaluate_0.23 [130] GenomicFeatures_1.52.1 cli_3.6.2 compiler_4.3.0 [133] Rsamtools_2.16.0 rlang_1.1.2 crayon_1.5.2 [136] labeling_0.4.3 classInt_0.4-10 ps_1.7.5 [139] plyr_1.8.8 fs_1.6.3 stringi_1.8.3 [142] coloc_5.2.3 viridisLite_0.4.2 BiocParallel_1.34.2 [145] assertthat_0.2.1 munsell_0.5.0 Biostrings_2.68.1 [148] lazyeval_0.2.2 fontquiver_0.2.1 Matrix_1.5-4 [151] MungeSumstats_1.8.0 BSgenome_1.68.0 hms_1.1.3 [154] bit64_4.0.5 future_1.33.1 gfonts_0.2.0 [157] KEGGREST_1.40.0 shiny_1.8.0 SummarizedExperiment_1.30.2 [160] googleAuthR_2.0.1 corrcoverage_1.2.1 gargle_1.5.2 [163] memoise_2.0.1 igraph_1.6.0 gt_0.10.0 [166] datapasta_3.1.0 bit_4.0.5 officer_0.6.3 ```

Al-Murphy commented 7 months ago

This same issue was discussed in the later half of here. I agree this is a sensible idea and can be implemented as follows:

Add a parameter where the user can list the columns they want tested for missing values. By default this will be all SNP (SNP, BP, A1, A2), effect columns (BETA, Z etc) and other more essential columns (FRQ, N). This parameter would be an easy modification to check_miss_data.R which is called in format_sumstats.R (so the parameter would need to be added to that function too).

Unfortunately, I don't have time to implement this myself now, if you have the time to add this functionality and submit a PR, I'll happily review and add it. I think it could be very useful too.

rmgpanw commented 7 months ago

Thanks for getting back to me so quickly, and with clear instructions. I will aim to give this a go later this week (hopefully) and keep you posted.

Al-Murphy / MungeSumstats