Column name handling: Freq vs . MAF

bschilder commented 2 years ago

I noticed that if the data has both the columns "frequency" and "MAF", the latter will be renamed to ""FRQ" while the former is just made uppercase.

I think this is because both are technically mapped to "FRQ", but it makes sense to prioritize using cols that are closer to the term "frequency" as the "FRQ" col. Only when none of these are available we can then go ahead and trying renaming MAF --> FRQ.

I can't recall, but perhaps this is already done within the full pipeline format_sumstats. So maybe I just need to update the standardise_header function? Not sure if that will screw anything up for the full pipeline.

Reprex

dat <- MungeSumstats::formatted_example(formatted = FALSE)
data.table::setnames(dat, "EAF","frequency")
dat$maf <- runif(nrow(dat), min = 0, max = 1)

dat2 <- MungeSumstats::standardise_header(dat, return_list = F)

Screenshot 2022-03-22 at 19 23 13

Session info

``` R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] echodata_0.99.7 loaded via a namespace (and not attached): [1] fs_1.5.2 bitops_1.0-7 matrixStats_0.61.0 [4] lubridate_1.8.0 bit64_4.0.5 filelock_1.0.2 [7] progress_1.2.2 httr_1.4.2 googleAuthR_2.0.0 [10] rprojroot_2.0.2 GenomeInfoDb_1.30.1 gh_1.3.0 [13] tools_4.1.0 utf8_1.2.2 R6_2.5.1 [16] DT_0.21 DBI_1.1.2 BiocGenerics_0.40.0 [19] withr_2.5.0 tidyselect_1.1.2 prettyunits_1.1.1 [22] bit_4.0.4 curl_4.3.2 compiler_4.1.0 [25] cli_3.2.0 Biobase_2.54.0 xml2_1.3.3 [28] desc_1.4.1 DelayedArray_0.20.0 rtracklayer_1.54.0 [31] readr_2.1.2 rappdirs_0.3.3 stringr_1.4.0 [34] digest_0.6.29 Rsamtools_2.10.0 piggyback_0.1.1 [37] R.utils_2.11.0 XVector_0.34.0 pkgconfig_2.0.3 [40] htmltools_0.5.2 MatrixGenerics_1.6.0 dbplyr_2.1.1 [43] fastmap_1.1.0 BSgenome_1.62.0 htmlwidgets_1.5.4 [46] rlang_1.0.2 rstudioapi_0.13 RSQLite_2.2.10 [49] BiocIO_1.4.0 generics_0.1.2 jsonlite_1.8.0 [52] BiocParallel_1.28.3 zip_2.2.0 dplyr_1.0.8 [55] R.oo_1.24.0 VariantAnnotation_1.40.0 RCurl_1.98-1.6 [58] magrittr_2.0.2 GenomeInfoDbData_1.2.7 Matrix_1.4-0 [61] Rcpp_1.0.8.3 S4Vectors_0.32.3 fansi_1.0.2 [64] lifecycle_1.0.1 R.methodsS3_1.8.1 stringi_1.7.6 [67] yaml_2.3.5 SummarizedExperiment_1.24.0 zlibbioc_1.40.0 [70] brio_1.1.3 BiocFileCache_2.2.1 grid_4.1.0 [73] blob_1.2.2 parallel_4.1.0 crayon_1.5.0 [76] lattice_0.20-45 Biostrings_2.62.0 GenomicFeatures_1.46.5 [79] hms_1.1.1 KEGGREST_1.34.0 MungeSumstats_1.3.14 [82] pillar_1.7.0 GenomicRanges_1.46.1 rjson_0.2.21 [85] clisymbols_1.2.0 biomaRt_2.50.3 stats4_4.1.0 [88] pkgload_1.2.4 XML_3.99-0.9 glue_1.6.2 [91] data.table_1.14.2 tzdb_0.2.0 png_0.1-7 [94] vctrs_0.3.8 testthat_3.1.2 tidyr_1.2.0 [97] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.6 [100] openxlsx_4.2.5 restfulr_0.0.13 gargle_1.2.0 [103] tibble_3.1.6 GenomicAlignments_1.30.0 AnnotationDbi_1.56.2 [106] memoise_2.0.1 IRanges_2.28.0 ellipsis_0.3.2 ```

Al-Murphy commented 2 years ago

I'm not sure on that approach. The order is defined by the mapping file, maybe a better option would be to warn the user if they have multiple values that match the mapped value, tell them which one will be taken and inform them to edit the mapping file if this is not the correct one? I don't think we can assume an order of priority here

bschilder commented 2 years ago

I'm not sure on that approach. The order is defined by the mapping file, maybe a better option would be to warn the user if they have multiple values that match the mapped value, tell them which one will be taken and inform them to edit the mapping file if this is not the correct one? I don't think we can assume an order of priority here

True, but aren't we already kind of assuming a priority simply by ordering them (albeit arbitrarily) in the colmap file? Perhaps the best thing would be to just move up "FREQUENCY" in the colmap so that it's the first hit by default, but still provide the user a warning message about the ambiguity?

Al-Murphy commented 2 years ago

Yeah that sounds reasonable!

Al-Murphy commented 2 years ago

Updated order of FREQUENCY and MAF

Al-Murphy / MungeSumstats

Column name handling: Freq vs . MAF #95

Reprex

Session info