Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Column name handling: "NGT" #97

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

1. Bug description

Just came across a situation where "NGT" was a column in one of the PGC GWAS, which indicates " the number of sub-cohorts in which the variant is genotyped". So all the numbers are of course very small. However MSS maps "NGT" onto "N", making it seem like sample sizes are super tiny!

Have you seen cases where "NGT" actually means "N", or was this mapping inherited from elsewhere? If both are possible, then perhaps we could at least add a warning saying "NGT" is ambiguous, or maybe just removing it from the mapping file entirely.

Console output

Screenshot 2022-03-24 at 14 16 40

Expected behaviour

Fix all problems in biology.

2. Reproducible example

Code

dat <- MungeSumstats::formatted_example(formatted = F)
dat$NGT <- 4
dat2 <- MungeSumstats::standardise_header(sumstats_dt = dat, return_list = F)

3. Session info

``` R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] utf8_1.2.2 reticulate_1.24-9000 R.utils_2.11.0 tidyselect_1.1.2 [5] RSQLite_2.2.11 AnnotationDbi_1.56.2 htmlwidgets_1.5.4 grid_4.1.0 [9] BiocParallel_1.28.3 XGR_1.1.8 munsell_0.5.0 DT_0.21 [13] colorspace_2.0-3 OrganismDbi_1.36.0 Biobase_2.54.0 filelock_1.0.2 [17] knitr_1.37 supraHex_1.32.0 rstudioapi_0.13 stats4_4.1.0 [21] DescTools_0.99.44 MatrixGenerics_1.6.0 GenomeInfoDbData_1.2.7 mixsqp_0.3-43 [25] bit64_4.0.5 echoconda_0.99.5 vctrs_0.3.8 generics_0.1.2 [29] xfun_0.30 biovizBase_1.42.0 BiocFileCache_2.2.1 R6_2.5.1 [33] GenomeInfoDb_1.30.1 AnnotationFilter_1.18.0 bitops_1.0-7 cachem_1.0.6 [37] reshape_0.8.8 DelayedArray_0.20.0 assertthat_0.2.1 BiocIO_1.4.0 [41] scales_1.1.1 nnet_7.3-17 rootSolve_1.8.2.3 gtable_0.3.0 [45] lmom_2.8 ggbio_1.42.0 ensembldb_2.18.3 rlang_1.0.2 [49] clisymbols_1.2.0 MungeSumstats_1.3.16 echodata_0.99.7 splines_4.1.0 [53] lazyeval_0.2.2 rtracklayer_1.54.0 gargle_1.2.0 dichromat_2.0-0 [57] hexbin_1.28.2 checkmate_2.0.0 BiocManager_1.30.16 yaml_2.3.5 [61] reshape2_1.4.4 backports_1.4.1 snpStats_1.44.0 GenomicFeatures_1.46.5 [65] ggnetwork_0.5.10 Hmisc_4.6-0 RBGL_1.70.0 tools_4.1.0 [69] echoplot_0.99.2 ggplot2_3.3.5 ellipsis_0.3.2 RColorBrewer_1.1-2 [73] proxy_0.4-26 BiocGenerics_0.40.0 coloc_5.1.2 Rcpp_1.0.8.3 [77] plyr_1.8.6 base64enc_0.1-3 progress_1.2.2 zlibbioc_1.40.0 [81] purrr_0.3.4 RCurl_1.98-1.6 prettyunits_1.1.1 rpart_4.1.16 [85] viridis_0.6.2 S4Vectors_0.32.3 SummarizedExperiment_1.24.0 ggrepel_0.9.1 [89] cluster_2.1.2 fs_1.5.2 crul_1.2.0 magrittr_2.0.2 [93] data.table_1.14.2 echotabix_0.99.5 dnet_1.1.7 openxlsx_4.2.5 [97] gh_1.3.0 mvtnorm_1.1-3 ProtGenerics_1.26.0 matrixStats_0.61.0 [101] evaluate_0.15 patchwork_1.1.1 hms_1.1.1 XML_3.99-0.9 [105] jpeg_0.1-9 IRanges_2.28.0 gridExtra_2.3 compiler_4.1.0 [109] biomaRt_2.50.3 tibble_3.1.6 crayon_1.5.0 R.oo_1.24.0 [113] htmltools_0.5.2 echoannot_0.99.4 tzdb_0.2.0 Formula_1.2-4 [117] tidyr_1.2.0 expm_0.999-6 Exact_3.1 lubridate_1.8.0 [121] DBI_1.1.2 dbplyr_2.1.1 MASS_7.3-55 rappdirs_0.3.3 [125] boot_1.3-28 Matrix_1.4-0 readr_2.1.2 piggyback_0.1.1 [129] cli_3.2.0 R.methodsS3_1.8.1 parallel_4.1.0 echofinemap_0.99.0 [133] igraph_1.2.11 GenomicRanges_1.46.1 pkgconfig_2.0.3 GenomicAlignments_1.30.0 [137] RCircos_1.2.2 foreign_0.8-82 osfr_0.2.8 xml2_1.3.3 [141] XVector_0.34.0 echoLD_0.99.1 stringr_1.4.0 VariantAnnotation_1.40.0 [145] digest_0.6.29 graph_1.72.0 httpcode_0.3.0 Biostrings_2.62.0 [149] rmarkdown_2.13 htmlTable_2.4.0 gld_2.6.4 restfulr_0.0.13 [153] curl_4.3.2 Rsamtools_2.10.0 rjson_0.2.21 lifecycle_1.0.1 [157] nlme_3.1-155 jsonlite_1.8.0 viridisLite_0.4.0 BSgenome_1.62.0 [161] fansi_1.0.2 pillar_1.7.0 susieR_0.11.97 lattice_0.20-45 [165] GGally_2.1.2 KEGGREST_1.34.0 fastmap_1.1.0 httr_1.4.2 [169] survival_3.3-1 googleAuthR_2.0.0 glue_1.6.2 zip_2.2.0 [173] png_0.1-7 bit_4.0.4 Rgraphviz_2.38.0 class_7.3-20 [177] stringi_1.7.6 blob_1.2.2 latticeExtra_0.6-29 memoise_2.0.1 [181] dplyr_1.0.8 irlba_2.3.5 e1071_1.7-9 ape_5.6-2 ```
Al-Murphy commented 2 years ago

Could have been inherited, @NathanSkene have you seen NGT before?

bschilder commented 2 years ago

Hey @Al-Murphy, mean to tell you, @NathanSkene and I discussed this in our last meeting and he said he doesn't recall coming across this column before. So we think it makes sense to remove it from the colmap.

Al-Murphy commented 2 years ago

Removed!