Crash when imputation_ind = TRUE and chr:bp identifiers are present

1. Bug description

The format_sumstats() function crashes when the imputation_ind parameter is set to TRUE and the GWAS summary statistics file contains some SNP identifiers coded as chr:bp.

I narrowed down the issue to lines 275-276 in check_no_rs_snp.R and inserted browser() right before that to investigate the cause. It turns out that at this point in the code, the miss_rs_chr_bp data table has an IMPUTATION_SNP column while the sumstats_dt data table does not have that column. Thus, the two data tables cannot be concatenated through rbindlist().

This should be an easy fix: either adding the IMPUTATION_SNP column to sumstats_dt or removing it from miss_rs_chr_bp. Since the IMPUTATION_SNP column gets added to miss_rs_chr_bp right before the function errors out and only contains NAs at this point (see lines 269-272), I feel that perhaps it should not be added in the first place. However, I am not familiar with all the logic in check_no_rs_snp() so I decided to open a bug report and let you identify the best solution.

2. Reproducible example

Code

Here is an example demonstrating the error message:

> MungeSumstats::format_sumstats(path = "sample.txt", ref_genome = "GRCH37", imputation_ind = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /tmp/Rtmpw8PDKm/file288a3614734e.tsv.gz
Importing tabular file: sample.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A2  A1  FRQ FRQSE   FRQMIN  FRQMAX  BETA    SE  P   DIRECTION   N   
Summary statistics report:
   - 10 rows
   - 10 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 4 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
5 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Loading SNPlocs data.
Error in data.table::rbindlist(list(sumstats_dt, miss_rs_chr_bp)) : 
  Item 2 has 16 columns, inconsistent with item 1 which has 15 columns. To fill missing columns use fill=TRUE.
In addition: Warning message:
replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP155.GRCh37’

This shows that the issue is caused by the IMPUTATION_SNP column being present in miss_rs_chr_bp but not in sumstats_dt:

Browse[1]> names(sumstats_dt)
 [1] "SNP"          "CHR"          "BP"           "A2"           "A1"           "FRQ"          "FRQSE"        "FRQMIN"       "FRQMAX"       "BETA"        
[11] "SE"           "P"            "DIRECTION"    "N"            "SNP_old_temp"
Browse[1]> names(miss_rs_chr_bp)
 [1] "SNP"            "CHR"            "BP"             "A2"             "A1"             "FRQ"            "FRQSE"          "FRQMIN"         "FRQMAX"        
[10] "BETA"           "SE"             "P"              "DIRECTION"      "N"              "SNP_old_temp"   "IMPUTATION_SNP"

Data

Here is the sample.txt file used in my example above, which I created by extracting 10 lines from a real GWAS summary statistics file:

SNP CHR BP A2 A1 FRQ FRQSE FRQMIN FRQMAX BETA SE P DIRECTION N
1:106365417 1 106365417 t c 0.0062 0.0019 0.0032 0.0074 0.0451 0.1491 0.7623 ++ 291
2:52080491 2 52080491 a g 0.0119 9e-04 0.011 0.0129 -0.016 0.1026 0.876 -+ 291
3:105018222 3 105018222 t c 0.0053 0.0014 0.0037 0.0065 0.0225 0.1608 0.8889 +- 291
3:133492768 3 133492768 t c 0.0035 2e-04 0.0032 0.0037 -0.0262 0.1845 0.8872 -+ 291
3:81770380 3 81770380 a t 0.9966 2e-04 0.9963 0.9968 0.2256 0.2415 0.3501 ++ 291
rs10000075 4 179488911 t c 0.1044 0.0052 0.0993 0.1097 -0.0757 0.0338 0.02509 -- 291
rs10000076 4 183288360 t g 0.0241 0.0019 0.0221 0.0258 0.0024 0.0761 0.975 +- 291
rs10000078 4 81654981 a g 0.7852 0.002 0.7831 0.7871 -0.0224 0.0276 0.4184 -- 291
rs10000081 4 17348363 t c 0.7479 0.0024 0.7452 0.75 -0.0084 0.0244 0.7316 +- 291
rs10000082 4 167310192 t c 0.0418 0.0027 0.0387 0.0441 -0.0067 0.0568 0.906 +- 291

3. Session info

```R > utils::sessionInfo() R version 4.3.0 (2023-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: Etc/UTC tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MungeSumstats_1.9.14 loaded via a namespace (and not attached): [1] tidyselect_1.2.0 dplyr_1.1.2 blob_1.2.4 [4] filelock_1.0.2 R.utils_2.12.2 Biostrings_2.68.1 [7] bitops_1.0-7 fastmap_1.1.1 RCurl_1.98-1.12 [10] BiocFileCache_2.8.0 VariantAnnotation_1.46.0 GenomicAlignments_1.36.0 [13] XML_3.99-0.14 digest_0.6.33 lifecycle_1.0.3 [16] KEGGREST_1.40.0 RSQLite_2.3.1 magrittr_2.0.3 [19] googleAuthR_2.0.1 compiler_4.3.0 rlang_1.1.1 [22] progress_1.2.2 tools_4.3.0 utf8_1.2.3 [25] yaml_2.3.7 data.table_1.14.8 rtracklayer_1.60.0 [28] prettyunits_1.1.1 S4Arrays_1.0.5 bit_4.0.5 [31] curl_5.0.1 DelayedArray_0.26.6 xml2_1.3.5 [34] abind_1.4-5 BiocParallel_1.34.2 BiocGenerics_0.46.0 [37] R.oo_1.25.0 grid_4.3.0 stats4_4.3.0 [40] fansi_1.0.4 biomaRt_2.56.1 SummarizedExperiment_1.30.2 [43] cli_3.6.1 crayon_1.5.2 generics_0.1.3 [46] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 rstudioapi_0.15.0 httr_1.4.6 [49] rjson_0.2.21 DBI_1.1.3 cachem_1.0.8 [52] stringr_1.5.0 zlibbioc_1.46.0 assertthat_0.2.1 [55] parallel_4.3.0 AnnotationDbi_1.62.2 XVector_0.40.0 [58] restfulr_0.0.15 matrixStats_1.0.0 vctrs_0.6.3 [61] Matrix_1.6-0 jsonlite_1.8.7 IRanges_2.34.1 [64] hms_1.1.3 S4Vectors_0.38.1 bit64_4.0.5 [67] GenomicFeatures_1.52.1 glue_1.6.2 codetools_0.2-19 [70] stringi_1.7.12 GenomeInfoDb_1.36.1 BiocIO_1.10.0 [73] GenomicRanges_1.52.0 tibble_3.2.1 pillar_1.9.0 [76] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.24 rappdirs_0.3.3 GenomeInfoDbData_1.2.10 [79] BSgenome_1.68.0 R6_2.5.1 dbplyr_2.3.3 [82] lattice_0.21-8 Biobase_2.60.0 R.methodsS3_1.8.2 [85] png_0.1-8 Rsamtools_2.16.0 memoise_2.0.1 [88] gargle_1.5.2 MatrixGenerics_1.12.2 fs_1.6.3 [91] pkgconfig_2.0.3 ```

Hey, thanks for digging into this - so there is other cases checked for in check_no_rs_snp that would cause the IMPUTATION_SNP column to be added to sumstats_dt so I don't want to fully remove the step of adding it to miss_rs_chr_bp. So I think it's best just to add a check to see if the column is present in sumstats_dt first. See below for running your example data with the updated version:

> MungeSumstats::format_sumstats(path = "~/Downloads/sample.txt", 
+                                ref_genome = "GRCH37", 
+                                imputation_ind = TRUE,
+                                return_data = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec037f235f2d.tsv.gz
Importing tabular file: ~/Downloads/sample.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A2  A1  FRQ FRQSE   FRQMIN  FRQMAX  BETA    SE  P   DIRECTION   N   
Summary statistics report:
   - 10 rows
   - 10 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 4 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
5 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 75 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 3 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
4 SNPs are non-biallelic. These will be removed.
1 SNPs (16.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmpq79fRN/fileec037f235f2d.tsv.gz
Summary statistics report:
   - 6 rows (60% of original 10 rows)
   - 6 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 4 chromosomes
Done munging in 1.269 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
            SNP CHR        BP A1 A2    FRQ  FRQSE FRQMIN FRQMAX    BETA     SE      P DIRECTION   N IMPUTATION_SNP flipped
1: rs1419981255   1 106365417  C  T 0.0062 0.0019 0.0032 0.0074  0.0451 0.1491 0.7623        ++ 291           TRUE      NA
2: rs1301730354   2  52080491  G  A 0.0119 0.0009 0.0110 0.0129 -0.0160 0.1026 0.8760        -+ 291           TRUE      NA
3: rs1231570724   3  81770380  A  T 0.0034 0.0002 0.9963 0.9968 -0.2256 0.2415 0.3501        ++ 291           TRUE    TRUE
4:   rs10000078   4  81654981  A  G 0.2148 0.0020 0.7831 0.7871  0.0224 0.0276 0.4184        -- 291             NA    TRUE
Returning data directly.
            SNP CHR        BP A1 A2    FRQ  FRQSE FRQMIN FRQMAX    BETA     SE      P DIRECTION   N IMPUTATION_SNP flipped
1: rs1419981255   1 106365417  C  T 0.0062 0.0019 0.0032 0.0074  0.0451 0.1491 0.7623        ++ 291           TRUE      NA
2: rs1301730354   2  52080491  G  A 0.0119 0.0009 0.0110 0.0129 -0.0160 0.1026 0.8760        -+ 291           TRUE      NA
3: rs1231570724   3  81770380  A  T 0.0034 0.0002 0.9963 0.9968 -0.2256 0.2415 0.3501        ++ 291           TRUE    TRUE
4:   rs10000078   4  81654981  A  G 0.2148 0.0020 0.7831 0.7871  0.0224 0.0276 0.4184        -- 291             NA    TRUE
5:   rs10000082   4 167310192  C  T 0.0418 0.0027 0.0387 0.0441 -0.0067 0.0568 0.9060        +- 291             NA      NA
6:   rs10000076   4 183288360  T  G 0.9759 0.0019 0.0221 0.0258 -0.0024 0.0761 0.9750        +- 291             NA    TRUE

This is available in v1.9.15, let me know if there are any other issues!

Al-Murphy / MungeSumstats