Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Error in ids2rowids(ids) : cannot extract the digital part of some SNP ids in 'ids' #123

Closed AMCalejandro closed 2 years ago

AMCalejandro commented 2 years ago

1. Bug description

The tool fails to perform the munging over chr:bp input data

Console output

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /tmp/Rtmp4OEijF/file2efc20555fbc.tsv.gz
Reading header.
Tabular format detected.
Importing tabular file: ~/echolocatoR/echolocatoR_LID/QC_metal_tpd.opdc.ppmi.pdstat.pdbp_BESTMODEL_MOTORBL_stderr_1.tbl
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
MarkerName  CHR BP  Allele1 Allele2 Freq1   FreqSE  MinFreq MaxFreq Effect  StdErr  P-value Direction   HetISq  HetChiSq    HetDf   HetPVal TotalSampleSize MAF_variability 
Summary statistics report:
   - 6,548,100 rows
   - 6,548,100 unique variants
   - 4 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Warning in withCallingHandlers(expr, message = function(c) if (inherits(c,  :
  NAs introduced by coercion
Validating RSIDs of 3,922,769 SNPs using BSgenome::snpsById...
Error in ids2rowids(ids) : 
  cannot extract the digital part of some SNP ids in 'ids'
Timing stopped at: 0.915 0.012 0.926

2. Reproducible example

Code

> reformatted <- 
+   MungeSumstats::format_sumstats(path=fullSS_path,
+                                  ref_genome="GRCh37")

Data

> head(fread(fullSS_path))
|--------------------------------------------------|
|==================================================|
    MarkerName CHR        BP Allele1 Allele2  Freq1 FreqSE MinFreq MaxFreq  Effect StdErr   P-value Direction HetISq HetChiSq HetDf HetPVal TotalSampleSize
1:  4:32435284   4  32435284       a       g 0.0194 0.0025  0.0168  0.0250  1.1245 0.1866 1.673e-09     +++++   21.4    5.087     4  0.2784            2687
2:  4:32376657   4  32376657       a       t 0.0188 0.0026  0.0168  0.0233  1.1262 0.1895 2.812e-09     +++++   17.2    4.829     4  0.3053            2687
3: 16:17044975  16  17044975       a       g 0.9764 0.0029  0.9664  0.9789 -1.1400 0.1962 6.265e-09     ---++    0.0    0.767     4  0.9428            2687
4:  1:53778300   1  53778300       a       g 0.0233 0.0068  0.0184  0.0383  1.0163 0.1796 1.527e-08     ++++?    0.0    2.359     3  0.5013            2610
5: 1:168645690   1 168645690       a       g 0.0466 0.0052  0.0333  0.0548  0.7671 0.1423 7.037e-08     +++++    6.6    4.284     4  0.3690            2687
6: 1:168645707   1 168645707       a       g 0.0465 0.0051  0.0336  0.0548  0.7661 0.1423 7.353e-08     +++++    7.5    4.325     4  0.3638            2687
   MAF_variability
1:          0.0082
2:          0.0065
3:          0.0125
4:          0.0199
5:          0.0215
6:          0.0212

3. Request

I think it would be quite useful being able to specify the format of the SNP columnd ( either chr:bp or rsid), and internally perform the conversion

4. Session info

sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 BSgenome_1.65.2 rtracklayer_1.57.0 Biostrings_2.65.3
[5] XVector_0.37.1 GenomicRanges_1.49.1 GenomeInfoDb_1.33.5 IRanges_2.31.2
[9] S4Vectors_0.35.3 BiocGenerics_0.43.1 forcats_0.5.2 stringr_1.4.1
[13] dplyr_1.0.10 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[17] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 data.table_1.14.2
[21] echolocatoR_2.0.1

loaded via a namespace (and not attached): [1] rappdirs_0.3.3 GGally_2.1.2 R.methodsS3_1.8.2
[4] echoLD_0.99.7 bit64_4.0.5 knitr_1.40
[7] irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0
[10] rpart_4.1.16 KEGGREST_1.37.3 RCurl_1.98-1.8
[13] AnnotationFilter_1.21.0 generics_0.1.3 GenomicFeatures_1.49.6
[16] callr_3.7.2 usethis_2.1.6 RSQLite_2.2.16
[19] proxy_0.4-27 chron_2.3-57 bit_4.0.4
[22] tzdb_0.3.0 httpuv_1.6.5 xml2_1.3.3
[25] lubridate_1.8.0 SummarizedExperiment_1.27.2 assertthat_0.2.1
[28] viridis_0.6.2 gargle_1.2.0 xfun_0.32
[31] hms_1.1.2 promises_1.2.0.1 evaluate_0.16
[34] fansi_1.0.3 restfulr_0.0.15 progress_1.2.2
[37] dbplyr_2.2.1 readxl_1.4.1 Rgraphviz_2.41.1
[40] igraph_1.3.4 DBI_1.1.3 htmlwidgets_1.5.4
[43] reshape_0.8.9 downloadR_0.99.4 googledrive_2.0.0
[46] ellipsis_0.3.2 backports_1.4.1 biomaRt_2.53.2
[49] deldir_1.0-6 MatrixGenerics_1.9.1 MungeSumstats_1.5.13
[52] vctrs_0.4.1 Biobase_2.57.1 remotes_2.4.2
[55] ensembldb_2.21.4 cachem_1.0.6 withr_2.5.0
[58] checkmate_2.1.0 GenomicAlignments_1.33.1 prettyunits_1.1.1
[61] cluster_2.1.3 ape_5.6-2 dir.expiry_1.5.0
[64] lazyeval_0.2.2 crayon_1.5.1 basilisk.utils_1.9.2
[67] crul_1.2.0 pkgconfig_2.0.3 pkgload_1.3.0
[70] nlme_3.1-159 ProtGenerics_1.29.0 XGR_1.1.8
[73] devtools_2.4.4 nnet_7.3-17 rlang_1.0.5
[76] miniUI_0.1.1.1 lifecycle_1.0.1 filelock_1.0.2
[79] httpcode_0.3.0 BiocFileCache_2.5.0 modelr_0.1.9
[82] echotabix_0.99.8 dichromat_2.0-0.1 rprojroot_2.0.3
[85] cellranger_1.1.0 coloc_5.1.0 matrixStats_0.62.0
[88] graph_1.75.0 Matrix_1.4-1 osfr_0.2.8
[91] boot_1.3-28 reprex_2.0.2 base64enc_0.1-3
[94] processx_3.7.0 googlesheets4_1.0.1 png_0.1-7
[97] viridisLite_0.4.1 rjson_0.2.21 rootSolve_1.8.2.3
[100] bitops_1.0-7 R.oo_1.25.0 ggnetwork_0.5.10
[103] blob_1.2.3 mixsqp_0.3-43 echoplot_0.99.5
[106] dnet_1.1.7 jpeg_0.1-9 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [109] echodata_0.99.12 scales_1.2.1 memoise_2.0.1
[112] magrittr_2.0.3 plyr_1.8.7 hexbin_1.28.2
[115] zlibbioc_1.43.0 compiler_4.2.0 echoconda_0.99.7
[118] BiocIO_1.7.1 RColorBrewer_1.1-3 catalogueR_1.0.0
[121] Rsamtools_2.13.4 cli_3.3.0 urlchecker_1.0.1
[124] echoannot_0.99.7 ps_1.7.1 patchwork_1.1.2
[127] htmlTable_2.4.1 Formula_1.2-4 MASS_7.3-58.1
[130] tidyselect_1.1.2 stringi_1.7.8 yaml_2.3.5
[133] supraHex_1.35.0 latticeExtra_0.6-30 ggrepel_0.9.1
[136] grid_4.2.0 VariantAnnotation_1.43.3 tools_4.2.0
[139] lmom_2.9 parallel_4.2.0 rstudioapi_0.14
[142] foreign_0.8-82 piggyback_0.1.3 gridExtra_2.3
[145] gld_2.6.5 digest_0.6.29 snpStats_1.47.1
[148] BiocManager_1.30.18 shiny_1.7.2 proto_1.0.0
[151] Rcpp_1.0.9 broom_1.0.1 later_1.3.0
[154] OrganismDbi_1.39.1 httr_1.4.4 AnnotationDbi_1.59.1
[157] RCircos_1.2.2 ggbio_1.45.0 biovizBase_1.45.0
[160] colorspace_2.0-3 brio_1.1.3 rvest_1.0.3
[163] XML_3.99-0.10 fs_1.5.2 reticulate_1.26
[166] splines_4.2.0 RBGL_1.73.0 expm_0.999-6
[169] echofinemap_0.99.3 basilisk_1.9.2 Exact_3.1
[172] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22 sessioninfo_1.2.2 xtable_1.8-4
[175] jsonlite_1.8.0 testthat_3.1.4 susieR_0.12.27
[178] R6_2.5.1 profvis_0.3.7 Hmisc_4.7-1
[181] gsubfn_0.7 mime_0.12 pillar_1.8.1
[184] htmltools_0.5.3 glue_1.6.2 fastmap_1.1.0
[187] DT_0.24 BiocParallel_1.31.12 class_7.3-20
[190] codetools_0.2-18 pkgbuild_1.3.1 mvtnorm_1.1-3
[193] utf8_1.2.2 lattice_0.20-45 sqldf_0.4-11
[196] curl_4.3.2 DescTools_0.99.46 zip_2.2.0
[199] openxlsx_4.2.5 interp_1.1-3 survival_3.3-1
[202] rmarkdown_2.16 desc_1.4.1 googleAuthR_2.0.0
[205] munsell_0.5.0 e1071_1.7-11 GenomeInfoDbData_1.2.8
[208] haven_2.5.1 reshape2_1.4.4 gtable_0.3.1

Al-Murphy commented 2 years ago

Ah so there is a parameter for cases where the MarkerName here i.e. the SNP ID isn't an rs id. So just run with the following parameter to avoid the error:

#using the data you provided
reformatted <- MungeSumstats::format_sumstats(path=fullSS_path,
                                              ref_genome="GRCh37",
                                              snp_ids_are_rs_ids=FALSE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp44rgjJ/file16a0561be1ce4.tsv.gz
Importing tabular file: ~/Downloads/test.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
V1  MarkerName  CHR BP  Allele1 Allele2 Freq1   FreqSE  MinFreq MaxFreq Effect  StdErr  P-value Direction   HetISq  HetChiSq    HetDf   HetPVal TotalSampleSize 
Summary statistics report:
   - 6 rows
   - 6 unique variants
   - 4 genome-wide significant variants (P<5e-8)
   - 3 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 0 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 0 seconds.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
1 SNPs (16.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp44rgjJ/file16a0561be1ce4.tsv.gz
Summary statistics report:
   - 6 rows (100% of original 6 rows)
   - 6 unique variants
   - 4 genome-wide significant variants (P<5e-8)
   - 3 chromosomes
Done munging in 0.047 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR        BP A1 A2 V1          ID    FRQ  FRQSE FRQMIN FRQMAX   BETA     SE         P DIRECTION HETISQT
1:  rs72673189   1  53778300  A  G 4:  1:53778300 0.0233 0.0068 0.0184 0.0383 1.0163 0.1796 1.527e-08     ++++?     0.0
2:  rs79432789   1 168645690  A  G 5: 1:168645690 0.0466 0.0052 0.0333 0.0548 0.7671 0.1423 7.037e-08     +++++     6.6
3:  rs75422764   1 168645707  A  G 6: 1:168645707 0.0465 0.0051 0.0336 0.0548 0.7661 0.1423 7.353e-08     +++++     7.5
4: rs150874134   4  32376657  A  T 2:  4:32376657 0.0188 0.0026 0.0168 0.0233 1.1262 0.1895 2.812e-09     +++++    17.2
   HETCHISQ HETDF HETPVAL NSTUDY
1:    2.359     3  0.5013   2610
2:    4.284     4  0.3690   2687
3:    4.325     4  0.3638   2687
4:    4.829     4  0.3053   2687
Returning path to saved data.

I'm adding a fix now so that the code will run up to the point where an appropriate error is produced in future. Thanks for the note.

Alan.