Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

MT chromosome value converted to NA #116

Closed ilevantis closed 2 years ago

ilevantis commented 2 years ago

1. Bug description

Chromosome value for SNPs on MT chromosome are converted to NA at some point during format_sumstats.

Console output

Using MungeSumstats_1.5.8:

> mini_df
       rsid chrom   pos other_allele effect_allele   beta
1 rs2853502    MT 13276            A             G 0.0168
  effect_allele_frequency standard_error p_value
1                   0.563         0.0507  0.7409
> MSS_results2 <- MungeSumstats::format_sumstats(path=mini_df,
                                                  ref_genome = "GRCH38",
                                                  INFO_filter = 0,
                                                  pos_se = FALSE,
                                                  compute_n = 92355,
                                                  N_dropNA = FALSE,
                                                  bi_allelic_filter = FALSE,
                                                  allele_flip_frq = FALSE,
                                                  rmv_chr = NULL,
                                                  rmv_chrPrefix = TRUE,
                                                  dbSNP = 155,
                                                  return_data = TRUE,
                                                  return_format = "data.table",
                                                  imputation_ind = TRUE,
                                                  nThread = 4,
                                                  log_folder='mss2_log',
                                                  log_mungesumstats_msgs=TRUE,
                                                  log_folder_ind=TRUE)
Returning data directly.
> MSS_results2$sumstats
         SNP  CHR    BP A1 A2   BETA   FRQ     SE      P     N
1: rs2853502 <NA> 13276  A  G 0.0168 0.563 0.0507 0.7409 92355

Expected behaviour

CHR column output is MT for mitochondrial SNPs.

2. Example

mini_df <- data.frame(rsid='rs2853502', chrom='MT', pos=13276, other_allele='A', effect_allele='G', beta=0.0168, effect_allele_frequency=0.563, standard_error=0.0507, p_value=0.7409)
MSS_results2 <- MungeSumstats::format_sumstats(path=mini_df,
                                                  ref_genome = "GRCH38",
                                                  INFO_filter = 0,
                                                  pos_se = FALSE,
                                                  compute_n = 92355,
                                                  N_dropNA = FALSE,
                                                  bi_allelic_filter = FALSE,
                                                  allele_flip_frq = FALSE,
                                                  rmv_chr = NULL,
                                                  rmv_chrPrefix = TRUE,
                                                  dbSNP = 155,
                                                  return_data = TRUE,
                                                  return_format = "data.table",
                                                  imputation_ind = TRUE,
                                                  nThread = 4,
                                                  log_folder='mss2_log',
                                                  log_mungesumstats_msgs=TRUE,
                                                  log_folder_ind=TRUE)
Returning data directly.

3. Session info

``` > sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster) Matrix products: default BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.20.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] dplyr_1.0.9 VariantAnnotation_1.43.2 [3] Rsamtools_2.13.3 Biostrings_2.65.1 [5] XVector_0.37.0 SummarizedExperiment_1.27.1 [7] Biobase_2.57.1 GenomicRanges_1.49.0 [9] GenomeInfoDb_1.33.3 IRanges_2.31.0 [11] S4Vectors_0.35.1 MatrixGenerics_1.9.1 [13] matrixStats_0.62.0 BiocGenerics_0.43.1 [15] MungeSumstats_1.5.8 data.table_1.14.2 [17] optparse_1.7.3 loaded via a namespace (and not attached): [1] httr_1.4.3 [2] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [3] jsonlite_1.8.0 [4] bit64_4.0.5 [5] R.utils_2.12.0 [6] assertthat_0.2.1 [7] BiocFileCache_2.5.0 [8] blob_1.2.3 [9] BSgenome_1.65.2 [10] GenomeInfoDbData_1.2.8 [11] yaml_2.3.5 [12] progress_1.2.2 [13] pillar_1.8.0 [14] RSQLite_2.2.15 [15] lattice_0.20-45 [16] glue_1.6.2 [17] digest_0.6.29 [18] googleAuthR_2.0.0 [19] Matrix_1.4-1 [20] R.oo_1.25.0 [21] XML_3.99-0.9 [22] pkgconfig_2.0.3 [23] biomaRt_2.53.2 [24] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 [25] zlibbioc_1.43.0 [26] purrr_0.3.4 [27] getopt_1.20.3 [28] BiocParallel_1.31.12 [29] tibble_3.1.8 [30] KEGGREST_1.37.3 [31] generics_0.1.3 [32] ellipsis_0.3.2 [33] cachem_1.0.6 [34] GenomicFeatures_1.49.5 [35] cli_3.3.0 [36] magrittr_2.0.3 [37] crayon_1.5.1 [38] memoise_2.0.1 [39] R.methodsS3_1.8.2 [40] fs_1.5.2 [41] fansi_1.0.3 [42] xml2_1.3.3 [43] tools_4.2.0 [44] prettyunits_1.1.1 [45] hms_1.1.1 [46] BiocIO_1.7.1 [47] gargle_1.2.0 [48] lifecycle_1.0.1 [49] stringr_1.4.0 [50] DelayedArray_0.23.1 [51] AnnotationDbi_1.59.1 [52] compiler_4.2.0 [53] rlang_1.0.4 [54] grid_4.2.0 [55] RCurl_1.98-1.8 [56] rjson_0.2.21 [57] rappdirs_0.3.3 [58] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22 [59] bitops_1.0-7 [60] SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.22 [61] restfulr_0.0.15 [62] codetools_0.2-18 [63] DBI_1.1.3 [64] curl_4.3.2 [65] R6_2.5.1 [66] GenomicAlignments_1.33.1 [67] rtracklayer_1.57.0 [68] fastmap_1.1.0 [69] bit_4.0.4 [70] utf8_1.2.2 [71] filelock_1.0.2 [72] stringi_1.7.6 [73] parallel_4.2.0 [74] Rcpp_1.0.9 [75] vctrs_0.4.1 [76] png_0.1-7 [77] dbplyr_2.2.1 [78] tidyselect_1.1.2 ```
Al-Murphy commented 2 years ago

Thanks for this, the issue was with the sort_coords function which didn't include MT as a factor. Surprisingly you are the first to notice (or maybe just mention) this. I guess it's rare that people are working with mitochondrial SNPs.

I've made a fix in version 1.5.12.

Cheers, Alan.