Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 15 forks source link

`NA` values lead to `rbindlist` error when `imputation_ind = TRUE` #150

Closed JustinCope closed 1 year ago

JustinCope commented 1 year ago

1. Bug description

When data contains NA values for chromosome and position, these are successfully imputed. However, when imputation_ind is TRUE, munging fails with error.

Console output

Success:

Formatted summary statistics will be saved to ==>  test_1.tsv
Standardising column headers.
First line of summary statistics file:
SNP CHR BP  EFFECT_ALLELE   REFERENCE_ALLELE    FREQ    BETA    SE  P
Summary statistics report:
   - 2 rows
   - 2 unique variants
   - 2 genome-wide significant variants (P<5e-8)
   - 2 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Loading SNPlocs data.
There is no Chromosome or Base Pair Position column found within the data. It must be inferred from other column information.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 11 seconds.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 7 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 1 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
1 SNPs (50%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> test_1.tsv
Summary statistics report:
   - 2 rows (100% of original 2 rows)
   - 2 unique variants
   - 2 genome-wide significant variants (P<5e-8)
   - 2 chromosomes
Done munging in 0.307 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
         SNP CHR        BP A1 A2 FRQ  BETA   SE     P
1: rs9442372   1   1018704  A  G 0.6  0.01 0.01 1e-08
2: rs1000000  12 126890980  G  A 0.4 -0.01 0.01 1e-08
Returning path to saved data.
[1] "test_1.tsv"

Failure:

Formatted summary statistics will be saved to ==>  test_2.tsv
Standardising column headers.
First line of summary statistics file:
SNP CHR BP  EFFECT_ALLELE   REFERENCE_ALLELE    FREQ    BETA    SE  P
Summary statistics report:
   - 2 rows
   - 2 unique variants
   - 2 genome-wide significant variants (P<5e-8)
   - 2 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Loading SNPlocs data.
There is no Chromosome or Base Pair Position column found within the data. It must be inferred from other column information.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 1 seconds.
Error in rbindlist(list(sumstats_dt, na_chr_bp$sumstats_dt)) :
  Item 2 has 11 columns, inconsistent with item 1 which has 9 columns. To fill missing columns use fill=TRUE.

Expected behaviour

Both tests should succeed, and the output for the second should include additional columns indicating that CHR and BP were imputed for the row with NAs.

2. Reproducible example

Code

d <- data.frame(
    SNP = c("rs1000000", "rs9442372"),
    CHR = c(12, NA),
    BP = c(126890980, NA),
    EFFECT_ALLELE = "A",
    REFERENCE_ALLELE = "G",
    FREQ = 0.4,
    BETA = -0.01,
    SE = 0.01,
    P = 1e-8
)

MungeSumstats::format_sumstats(
    path = d,
    ref_genome = "GRCh37",
    dbSNP = 144,
    save_path = "test_1.tsv",
)

MungeSumstats::format_sumstats(
    path = d,
    ref_genome = "GRCh37",
    dbSNP = 144,
    save_path = "test_2.tsv",
    imputation_ind = TRUE
)

Data

See dummy data above.

3. Session info

``` R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2 [4] BiocGenerics_0.44.0 MungeSumstats_1.7.8 loaded via a namespace (and not attached): [1] MatrixGenerics_1.10.0 [2] Biobase_2.58.0 [3] httr_1.4.5 [4] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [5] bit64_4.0.5 [6] jsonlite_1.8.4 [7] R.utils_2.12.2 [8] assertthat_0.2.1 [9] BiocFileCache_2.6.1 [10] blob_1.2.4 [11] BSgenome_1.66.3 [12] GenomeInfoDbData_1.2.9 [13] Rsamtools_2.14.0 [14] yaml_2.3.7 [15] progress_1.2.2 [16] pillar_1.9.0 [17] RSQLite_2.3.1 [18] lattice_0.20-45 [19] glue_1.6.2 [20] munger_1.0.0 [21] digest_0.6.31 [22] GenomicRanges_1.50.2 [23] XVector_0.38.0 [24] googleAuthR_2.0.1 [25] Matrix_1.5-1 [26] R.oo_1.25.0 [27] XML_3.99-0.14 [28] pkgconfig_2.0.3 [29] logger_0.2.2 [30] biomaRt_2.54.1 [31] zlibbioc_1.44.0 [32] BiocParallel_1.32.6 [33] tibble_3.2.1 [34] KEGGREST_1.38.0 [35] generics_0.1.3 [36] cachem_1.0.7 [37] SummarizedExperiment_1.28.0 [38] GenomicFeatures_1.50.4 [39] cli_3.6.1 [40] magrittr_2.0.3 [41] crayon_1.5.2 [42] memoise_2.0.1 [43] R.methodsS3_1.8.2 [44] fs_1.6.1 [45] fansi_1.0.4 [46] xml2_1.3.3 [47] tools_4.2.2 [48] data.table_1.14.8 [49] prettyunits_1.1.1 [50] hms_1.1.3 [51] BiocIO_1.8.0 [52] gargle_1.4.0 [53] lifecycle_1.0.3 [54] matrixStats_0.63.0 [55] stringr_1.5.0 [56] DelayedArray_0.24.0 [57] AnnotationDbi_1.60.2 [58] Biostrings_2.66.0 [59] compiler_4.2.2 [60] rlang_1.1.0 [61] grid_4.2.2 [62] RCurl_1.98-1.12 [63] VariantAnnotation_1.44.1 [64] rjson_0.2.21 [65] rappdirs_0.3.3 [66] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 [67] bitops_1.0-7 [68] restfulr_0.0.15 [69] codetools_0.2-18 [70] DBI_1.1.3 [71] curl_5.0.0 [72] R6_2.5.1 [73] GenomicAlignments_1.34.1 [74] dplyr_1.1.1 [75] rtracklayer_1.58.0 [76] fastmap_1.1.1 [77] bit_4.0.5 [78] utf8_1.2.3 [79] filelock_1.0.2 [80] stringi_1.7.12 [81] parallel_4.2.2 [82] vctrs_0.6.1 [83] png_0.1-8 [84] dbplyr_2.3.2 [85] tidyselect_1.2.0 ```
Al-Murphy commented 1 year ago

Hey!

Thanks for identifying this bug, I've made a fix and pushed it to the dev version of MSS (>=1.9.8). You can get it from bioconductor devel in the next few days our by installing the master branch now. Please reopen if this does not resolve the issue.

Cheers, Alan.