`format_sumstats()` fails to infer genome build

mykmal commented 1 year ago

1. Bug description

The get_genome_build() function assumes that the CHR, BP, and SNP columns are already properly formatted when attempting to match them with the GRCh37 and GRCh38 reference genomes. However, get_genome_build() gets called by format_sumstats() before those columns are standardized, causing assembly inference to fail for GWAS summary statistics files that don't already match the MungeSumstats format.

As an illustration of this issue, the example below shows that format_sumstats() incorrectly concludes the reference genome for a GWAS summary statistics file that has chromosome numbers coded with the "chr" prefix. In this example, get_genome_build() actually fails to find any matching SNPs for either one of the reference genomes.

Console output

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /tmp/Rtmpd9Xwka/file4bc5e38140f.tsv.gz
Reading header.
Tabular format detected.
Importing tabular bgz file: gwas_examples/ALS_Nicolas.tab.bgz
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Removing 1 empty columns.
Standardising column headers.
First line of summary statistics file: 
seqnames    start   end width   strand  P   SNP Beta    SE  A1  A2  MAF 
Summary statistics report:
   - 10,031,417 rows
   - 8,900,164 unique variants
   - 182 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10,000 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:utils’:

    findMatches

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

BSgenome::snpsById done in 208 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 239 seconds.
Inferred genome build: GRCH38
WARNING: Less than 10% of your sampled SNPs matched that of either reference genome, this may question the quality of your summary statistics file.
Checking SNP RSIDs.
1,131,254 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 8,919,160 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 146 seconds.
Found 1,947 Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
11,876 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 8,908,219 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 152 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 22,135 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 4,781,751 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Found 169,247 Indels. These won't be checked for duplicates based on RS ID as there can be multiples.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
192 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
Found 169,247 Indels. These won't be checked for duplicates based on base-pair position as there can be multiples.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Checking for duplicated rows.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
4,084,002 SNPs are non-biallelic. These will be removed.
Warning: When method is an integer, must be >0.
4,318,323 SNPs (89.9%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /tmp/Rtmpd9Xwka/file4bc5e38140f.tsv.gz
Summary statistics report:                                                                                                           
   - 4,802,274 rows (47.9% of original 10,031,417 rows)
   - 4,802,272 unique variants
   - 88 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 24.299 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR    BP A1 A2   END WIDTH STRAND       P    BETA     SE    FRQ
1:  rs58108140   1 10583  G  A 10583     1      * 0.42730 -0.0312 0.0393 0.8411
2: rs116400033   1 51479  T  A 51479     1      * 0.05465 -0.0711 0.0370 0.8171
3: rs146477069   1 54421  A  G 54421     1      * 0.77250  0.0240 0.0830 0.9648
4:   rs2462492   1 54676  C  T 54676     1      * 0.16360  0.0567 0.0407 0.8457
Returning path to saved data.
[1] "/tmp/Rtmpd9Xwka/file4bc5e38140f.tsv.gz"
Warning messages:
1: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP155.GRCh37’ 
2: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP155.GRCh38’

Expected behaviour

I manually confirmed that my GWAS file adheres to the GRCh37 genome assembly. MSS should be able to identify this file as GRCh37, and then perform liftover to convert it to GRCh38.

2. Reproducible example

Code

MungeSumstats::format_sumstats("gwas_examples/ALS_Nicolas.tab.bgz", convert_ref_genome = "GRCh38")

Data

Head of the original GWAS summary statistics file:

seqnames        start   end     width   strand  P       SNP     Beta    SE      A1      A2      N       MAF
chr1    10583   10583   1       *       0.4273  rs58108140      0.0312  0.0393  A       G       NA      0.1589
chr1    13302   13302   1       *       0.8982  rs180734498     0.0065  0.0508  T       C       NA      0.0934
chr1    13327   13327   1       *       0.1626  rs2691329       0.1265  0.0906  C       G       NA      0.0297
chr1    30923   30923   1       *       0.7467  rs806731        -0.0114 0.0353  T       G       NA      0.7843
chr1    51479   51479   1       *       0.05465 rs116400033     0.0711  0.037   A       T       NA      0.1829
chr1    52058   52058   1       *       0.6303  rs62637813      -0.0436 0.0906  C       G       NA      0.0282
chr1    54421   54421   1       *       0.7725  rs146477069     0.024   0.083   A       G       NA      0.9648
chr1    54490   54490   1       *       0.02281 rs141149254     0.0963  0.0423  A       G       NA      0.1353
chr1    54676   54676   1       *       0.1636  rs2462492       -0.0567 0.0407  T       C       NA      0.1543
chr1    54753   54753   1       *       0.4894  rs143174675     0.0766  0.1108  T       G       NA      0.9812
chr1    55249   55249   1       *       0.2834  NA      -0.096  0.0895  CTATGG  C       NA      0.0306
chr1    55299   55299   1       *       0.3212  rs10399749      -0.0371 0.0374  T       C       NA      0.1784
chr1    58814   58814   1       *       0.8672  rs114420996     0.0096  0.0574  A       G       NA      0.0693
chr1    59040   59040   1       *       0.655   rs62637815      -0.0265 0.0593  T       C       NA      0.9363
chr1    60726   60726   1       *       0.8935  rs192328835     0.0167  0.1247  A       C       NA      0.0177000000000001
chr1    61462   61462   1       *       0.2692  rs56992750      0.1307  0.1183  A       T       NA      0.0141
chr1    61987   61987   1       *       0.1365  rs76735897      0.0472  0.0317  A       G       NA      0.6973
chr1    61989   61989   1       *       0.1365  rs77573425      -0.0472 0.0317  C       G       NA      0.3027
chr1    63671   63671   1       *       0.3763  rs80011619      0.0361  0.0408  A       G       NA      0.1425
chr1    63735   63735   1       *       0.4454  NA      -0.0219 0.0287  CCTA    C       NA      0.5041
chr1    64649   64649   1       *       0.7888  rs181431124     -0.0284 0.106   A       C       NA      0.9776
chr1    66162   66162   1       *       0.2149  rs62639105      0.0382  0.0308  A       T       NA      0.67
chr1    66507   66507   1       *       0.5922  rs12401368      0.0549  0.1025  A       T       NA      0.0232
chr1    69511   69511   1       *       0.7062  rs2691305       0.0141  0.0374  A       G       NA      0.1869

Additional troubleshooting

To better understand the issue, I injected browser() right after the following lines: https://github.com/neurogenomics/MungeSumstats/blob/329185602f294413c02b06452e4c0f72171dd5e5/R/get_genome_build.R#L140-L150

Surprisingly, zero matches were found on either reference genome:

Browse[1]> num_37
[1] 0
Browse[1]> num_38
[1] 0

But after removing the "chr" prefix from each entry in the CHR column of sumstats, we get the expected behavior:

Browse[1]> sumstats$CHR <- gsub("chr", "", sumstats$CHR)
Browse[1]> nrow(snp_loc_data_37[sumstats, ,
+                      on = c("SNP" = "SNP", "pos" = "BP", "seqnames" = "CHR"),
+                      nomatch = FALSE])
[1] 9981
Browse[1]> nrow(snp_loc_data_38[sumstats, ,
+                      on = c("SNP" = "SNP", "pos" = "BP", "seqnames" = "CHR"),
+                      nomatch = FALSE])
[1] 87

3. Session info

``` R version 4.3.0 (2023-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: Etc/UTC tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MungeSumstats_1.8.0 loaded via a namespace (and not attached): [1] KEGGREST_1.40.0 SummarizedExperiment_1.30.2 rjson_0.2.21 gargle_1.5.1 Biobase_2.60.0 lattice_0.21-8 [7] generics_0.1.3 vctrs_0.6.2 tools_4.3.0 bitops_1.0-7 curl_5.0.1 stats4_4.3.0 [13] parallel_4.3.0 tibble_3.2.1 fansi_1.0.4 AnnotationDbi_1.62.1 RSQLite_2.3.1 blob_1.2.4 [19] pkgconfig_2.0.3 R.oo_1.25.0 Matrix_1.5-4 data.table_1.14.8 BSgenome_1.68.0 dbplyr_2.3.2 [25] S4Vectors_0.38.1 assertthat_0.2.1 lifecycle_1.0.3 GenomeInfoDbData_1.2.10 compiler_4.3.0 stringr_1.5.0 [31] Rsamtools_2.16.0 Biostrings_2.68.1 progress_1.2.2 codetools_0.2-19 GenomeInfoDb_1.36.1 RCurl_1.98-1.12 [37] yaml_2.3.7 pillar_1.9.0 crayon_1.5.2 R.utils_2.12.2 BiocParallel_1.34.2 DelayedArray_0.26.3 [43] cachem_1.0.7 googleAuthR_2.0.1 tidyselect_1.2.0 digest_0.6.31 stringi_1.7.12 dplyr_1.1.2 [49] restfulr_0.0.15 VariantAnnotation_1.46.0 biomaRt_2.56.1 fastmap_1.1.1 grid_4.3.0 cli_3.6.1 [55] magrittr_2.0.3 S4Arrays_1.0.4 GenomicFeatures_1.52.0 utf8_1.2.3 XML_3.99-0.14 rappdirs_0.3.3 [61] filelock_1.0.2 prettyunits_1.1.1 bit64_4.0.5 XVector_0.40.0 httr_1.4.6 matrixStats_1.0.0 [67] bit_4.0.5 png_0.1-8 R.methodsS3_1.8.2 hms_1.1.3 memoise_2.0.1 GenomicRanges_1.52.0 [73] IRanges_2.34.0 BiocIO_1.10.0 BiocFileCache_2.8.0 rtracklayer_1.60.0 rlang_1.1.1 glue_1.6.2 [79] DBI_1.1.3 xml2_1.3.4 BiocGenerics_0.46.0 jsonlite_1.8.5 rstudioapi_0.14 R6_2.5.1 [85] fs_1.6.2 MatrixGenerics_1.12.2 GenomicAlignments_1.36.0 zlibbioc_1.46.0 ```

Al-Murphy commented 1 year ago

Hey! Thanks for the detailed report. Some things to note:

Moreover, I think that get_genome_build() should fail if no matches are found, rather than defaulting to GRCh38 (which is the current behavior). I agree, I have added this fix - if no match is found in either genome build MSS will now fail.

incorrectly concludes the reference genome for a GWAS summary statistics file that has chromosome numbers coded with the "chr" prefix I have added a check for this in the check genome build function so it should work as expected now but let me know if it doesn't?

Instead, I suggest rearranging the QC steps in format_sumstats() to put all QC steps that don't rely on the GWAS assembly before the call to get_genome_build() I had a look through all our 60+ checks and the vast majority require a genome reference. Moreover, these checks are interdependent so even if a check itself doesn't require a genome reference, a check it relies on might. This gets quite complicated so I have avoided moving any checks for SNP, CHR and BP to above the inference of genome build for now. However, do let me know if you come across any other examples other than the 'CHR' prefix which causes an issue with inferring the genome build and I'll be happy to update.

These changes have been added to MSS v1.9.9. You can install this from Github now or wait a few days for it to go up on the devel version of Bioconductor. Let me know if your issue persists (feel free to reopen this issue if so).

Cheers, Alan.

mykmal commented 1 year ago

Thank you for fixing this bug so quickly! The updated version now works correctly on my example GWAS file.

As for rearranging the QC checks, I can see why doing so would be more complicated than I initially imagined. I'll let you know if I run into any other GWAS formats that break get_genome_build().

Al-Murphy / MungeSumstats