Closed AMCalejandro closed 2 years ago
Ah so there is a parameter for cases where the MarkerName here i.e. the SNP ID isn't an rs id. So just run with the following parameter to avoid the error:
#using the data you provided
reformatted <- MungeSumstats::format_sumstats(path=fullSS_path,
ref_genome="GRCh37",
snp_ids_are_rs_ids=FALSE)
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp44rgjJ/file16a0561be1ce4.tsv.gz
Importing tabular file: ~/Downloads/test.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file:
V1 MarkerName CHR BP Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P-value Direction HetISq HetChiSq HetDf HetPVal TotalSampleSize
Summary statistics report:
- 6 rows
- 6 unique variants
- 4 genome-wide significant variants (P<5e-8)
- 3 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 0 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 0 seconds.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
1 SNPs (16.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp44rgjJ/file16a0561be1ce4.tsv.gz
Summary statistics report:
- 6 rows (100% of original 6 rows)
- 6 unique variants
- 4 genome-wide significant variants (P<5e-8)
- 3 chromosomes
Done munging in 0.047 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 V1 ID FRQ FRQSE FRQMIN FRQMAX BETA SE P DIRECTION HETISQT
1: rs72673189 1 53778300 A G 4: 1:53778300 0.0233 0.0068 0.0184 0.0383 1.0163 0.1796 1.527e-08 ++++? 0.0
2: rs79432789 1 168645690 A G 5: 1:168645690 0.0466 0.0052 0.0333 0.0548 0.7671 0.1423 7.037e-08 +++++ 6.6
3: rs75422764 1 168645707 A G 6: 1:168645707 0.0465 0.0051 0.0336 0.0548 0.7661 0.1423 7.353e-08 +++++ 7.5
4: rs150874134 4 32376657 A T 2: 4:32376657 0.0188 0.0026 0.0168 0.0233 1.1262 0.1895 2.812e-09 +++++ 17.2
HETCHISQ HETDF HETPVAL NSTUDY
1: 2.359 3 0.5013 2610
2: 4.284 4 0.3690 2687
3: 4.325 4 0.3638 2687
4: 4.829 4 0.3053 2687
Returning path to saved data.
I'm adding a fix now so that the code will run up to the point where an appropriate error is produced in future. Thanks for the note.
Alan.
1. Bug description
The tool fails to perform the munging over chr:bp input data
Console output
2. Reproducible example
Code
Data
3. Request
I think it would be quite useful being able to specify the format of the SNP columnd ( either chr:bp or rsid), and internally perform the conversion
4. Session info
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 BSgenome_1.65.2 rtracklayer_1.57.0 Biostrings_2.65.3
[5] XVector_0.37.1 GenomicRanges_1.49.1 GenomeInfoDb_1.33.5 IRanges_2.31.2
[9] S4Vectors_0.35.3 BiocGenerics_0.43.1 forcats_0.5.2 stringr_1.4.1
[13] dplyr_1.0.10 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[17] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 data.table_1.14.2
[21] echolocatoR_2.0.1
loaded via a namespace (and not attached): [1] rappdirs_0.3.3 GGally_2.1.2 R.methodsS3_1.8.2
[4] echoLD_0.99.7 bit64_4.0.5 knitr_1.40
[7] irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0
[10] rpart_4.1.16 KEGGREST_1.37.3 RCurl_1.98-1.8
[13] AnnotationFilter_1.21.0 generics_0.1.3 GenomicFeatures_1.49.6
[16] callr_3.7.2 usethis_2.1.6 RSQLite_2.2.16
[19] proxy_0.4-27 chron_2.3-57 bit_4.0.4
[22] tzdb_0.3.0 httpuv_1.6.5 xml2_1.3.3
[25] lubridate_1.8.0 SummarizedExperiment_1.27.2 assertthat_0.2.1
[28] viridis_0.6.2 gargle_1.2.0 xfun_0.32
[31] hms_1.1.2 promises_1.2.0.1 evaluate_0.16
[34] fansi_1.0.3 restfulr_0.0.15 progress_1.2.2
[37] dbplyr_2.2.1 readxl_1.4.1 Rgraphviz_2.41.1
[40] igraph_1.3.4 DBI_1.1.3 htmlwidgets_1.5.4
[43] reshape_0.8.9 downloadR_0.99.4 googledrive_2.0.0
[46] ellipsis_0.3.2 backports_1.4.1 biomaRt_2.53.2
[49] deldir_1.0-6 MatrixGenerics_1.9.1 MungeSumstats_1.5.13
[52] vctrs_0.4.1 Biobase_2.57.1 remotes_2.4.2
[55] ensembldb_2.21.4 cachem_1.0.6 withr_2.5.0
[58] checkmate_2.1.0 GenomicAlignments_1.33.1 prettyunits_1.1.1
[61] cluster_2.1.3 ape_5.6-2 dir.expiry_1.5.0
[64] lazyeval_0.2.2 crayon_1.5.1 basilisk.utils_1.9.2
[67] crul_1.2.0 pkgconfig_2.0.3 pkgload_1.3.0
[70] nlme_3.1-159 ProtGenerics_1.29.0 XGR_1.1.8
[73] devtools_2.4.4 nnet_7.3-17 rlang_1.0.5
[76] miniUI_0.1.1.1 lifecycle_1.0.1 filelock_1.0.2
[79] httpcode_0.3.0 BiocFileCache_2.5.0 modelr_0.1.9
[82] echotabix_0.99.8 dichromat_2.0-0.1 rprojroot_2.0.3
[85] cellranger_1.1.0 coloc_5.1.0 matrixStats_0.62.0
[88] graph_1.75.0 Matrix_1.4-1 osfr_0.2.8
[91] boot_1.3-28 reprex_2.0.2 base64enc_0.1-3
[94] processx_3.7.0 googlesheets4_1.0.1 png_0.1-7
[97] viridisLite_0.4.1 rjson_0.2.21 rootSolve_1.8.2.3
[100] bitops_1.0-7 R.oo_1.25.0 ggnetwork_0.5.10
[103] blob_1.2.3 mixsqp_0.3-43 echoplot_0.99.5
[106] dnet_1.1.7 jpeg_0.1-9 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [109] echodata_0.99.12 scales_1.2.1 memoise_2.0.1
[112] magrittr_2.0.3 plyr_1.8.7 hexbin_1.28.2
[115] zlibbioc_1.43.0 compiler_4.2.0 echoconda_0.99.7
[118] BiocIO_1.7.1 RColorBrewer_1.1-3 catalogueR_1.0.0
[121] Rsamtools_2.13.4 cli_3.3.0 urlchecker_1.0.1
[124] echoannot_0.99.7 ps_1.7.1 patchwork_1.1.2
[127] htmlTable_2.4.1 Formula_1.2-4 MASS_7.3-58.1
[130] tidyselect_1.1.2 stringi_1.7.8 yaml_2.3.5
[133] supraHex_1.35.0 latticeExtra_0.6-30 ggrepel_0.9.1
[136] grid_4.2.0 VariantAnnotation_1.43.3 tools_4.2.0
[139] lmom_2.9 parallel_4.2.0 rstudioapi_0.14
[142] foreign_0.8-82 piggyback_0.1.3 gridExtra_2.3
[145] gld_2.6.5 digest_0.6.29 snpStats_1.47.1
[148] BiocManager_1.30.18 shiny_1.7.2 proto_1.0.0
[151] Rcpp_1.0.9 broom_1.0.1 later_1.3.0
[154] OrganismDbi_1.39.1 httr_1.4.4 AnnotationDbi_1.59.1
[157] RCircos_1.2.2 ggbio_1.45.0 biovizBase_1.45.0
[160] colorspace_2.0-3 brio_1.1.3 rvest_1.0.3
[163] XML_3.99-0.10 fs_1.5.2 reticulate_1.26
[166] splines_4.2.0 RBGL_1.73.0 expm_0.999-6
[169] echofinemap_0.99.3 basilisk_1.9.2 Exact_3.1
[172] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22 sessioninfo_1.2.2 xtable_1.8-4
[175] jsonlite_1.8.0 testthat_3.1.4 susieR_0.12.27
[178] R6_2.5.1 profvis_0.3.7 Hmisc_4.7-1
[181] gsubfn_0.7 mime_0.12 pillar_1.8.1
[184] htmltools_0.5.3 glue_1.6.2 fastmap_1.1.0
[187] DT_0.24 BiocParallel_1.31.12 class_7.3-20
[190] codetools_0.2-18 pkgbuild_1.3.1 mvtnorm_1.1-3
[193] utf8_1.2.2 lattice_0.20-45 sqldf_0.4-11
[196] curl_4.3.2 DescTools_0.99.46 zip_2.2.0
[199] openxlsx_4.2.5 interp_1.1-3 survival_3.3-1
[202] rmarkdown_2.16 desc_1.4.1 googleAuthR_2.0.0
[205] munsell_0.5.0 e1071_1.7-11 GenomeInfoDbData_1.2.8
[208] haven_2.5.1 reshape2_1.4.4 gtable_0.3.1