`get_genome_build`: Can't find SNP, CHR, BP in PGC VCFs

bschilder commented 2 years ago

1. Bug description

There seems to be some issues when trying to munge the Psychiatric Genomics Consortium (PGC) sumstats format, which is a bit different from the OpenGWAS format.

FIrst of all, the file names end in ".vcf.tsv.gz" which is confusing and might be tripping up our code that infers file type by extension names. Wondering if this is happening due to a slight discrepancy between how read_sumstats and read_header are inferring file type, because format_sumstats does manage to get partway through before hitting an error (it even correctly counts the number of rows!).

Here's part of the header from one of these files: Screenshot 2022-03-16 at 22 48 08

Console output

Error from the first reprex below.

Error in get_genome_build(sumstats = sumstats_return$sumstats_dt, standardise_headers = FALSE,  : 
  SNP ID column (RS ID), CHR and BP (POSITION) columns are needed to infer the genome build. These could not be
found in your dataset. Please specify the genome build manually to run format_sumstats().

Also including the full message output. MungeSumstats_log_msg.txt

Expected behaviour

format_sumstats is able to run the full pipeline and produce a munged tsv.

2. Reproducible example

Code

This produces an error

fullSS_path_vcf <- "~/Downloads/pgc-bip2021-all.vcf.tsv.gz" 
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path_vcf, 
                                              sort_coordinates = TRUE,
                                              log_folder = "~/Downloads/logs", 
                                              log_mungesumstats_msgs = TRUE, 
                                              log_folder_ind = TRUE)

But this works ok

I tried reformatting the file manually so it was undoubtedly a regular tsv file.

#### Set up paths #### 
fullSS_path_tsv <- gsub("\\.vcf","",fullSS_path_vcf) 
fullSS_path_munged <- gsub("-all","-all_munged",fullSS_path_tsv)
#### Edit ####
dat <- data.table::fread(fullSS_path_vcf, 
                         skip = "#CHROM")
colnames(dat) <- gsub("#","",colnames(dat))
#### Save ####
data.table::fwrite(x = dat, 
                   file = fullSS_path_tsv, 
                   sep="\t")
#### Munge ####
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path_tsv, 
                                              save_path = fullSS_path_munged, 
                                              sort_coordinates = TRUE,
                                              log_folder = "~/Downloads/logs", 
                                              log_mungesumstats_msgs = TRUE, 
                                              log_folder_ind = TRUE)

Data

The data can be downloaded here.

Session info

``` R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_7.0.0 ggimage_0.3.0 ggplot2_3.3.5 dplyr_1.0.8 hexSticker_0.4.9 [6] echotabix_0.99.3 loaded via a namespace (and not attached): [1] AnnotationHub_3.2.2 BiocFileCache_2.2.1 systemfonts_1.0.4 [4] igraph_1.2.11 BiocParallel_1.28.3 GenomeInfoDb_1.30.1 [7] digest_0.6.29 yulab.utils_0.0.4 htmltools_0.5.2 [10] magick_2.7.3 fansi_1.0.2 magrittr_2.0.2 [13] memoise_2.0.1 BSgenome_1.62.0 echoverseTemplate_0.99.0 [16] ontologyPlot_1.6 openxlsx_4.2.5 Biostrings_2.62.0 [19] matrixStats_0.61.0 R.utils_2.11.0 sysfonts_0.8.5 [22] prettyunits_1.1.1 colorspace_2.0-3 blob_1.2.2 [25] rappdirs_0.3.3 textshaping_0.3.6 xfun_0.30 [28] crayon_1.5.0 RCurl_1.98-1.6 echodata_0.99.6 [31] jsonlite_1.8.0 hexbin_1.28.2 graph_1.72.0 [34] VariantAnnotation_1.40.0 glue_1.6.2 gtable_0.3.0 [37] zlibbioc_1.40.0 XVector_0.34.0 DelayedArray_0.20.0 [40] Rgraphviz_2.38.0 BiocGenerics_0.40.0 scales_1.1.1 [43] DBI_1.1.2 Rcpp_1.0.8.2 showtextdb_3.0 [46] xtable_1.8-4 progress_1.2.2 gridGraphics_0.5-1 [49] bit_4.0.4 clisymbols_1.2.0 stats4_4.1.0 [52] DT_0.21 htmlwidgets_1.5.4 httr_1.4.2 [55] ontologyIndex_2.7 ellipsis_0.3.2 pkgconfig_2.0.3 [58] XML_3.99-0.9 R.methodsS3_1.8.1 farver_2.1.0 [61] seqminer_8.4 dbplyr_2.1.1 utf8_1.2.2 [64] here_1.0.1 ggplotify_0.1.0 tidyselect_1.1.2 [67] labeling_0.4.2 rlang_1.0.2 later_1.3.0 [70] AnnotationDbi_1.56.2 BiocVersion_3.14.0 munsell_0.5.0 [73] tools_4.1.0 cachem_1.0.6 cli_3.2.0 [76] generics_0.1.2 RSQLite_2.2.10 evaluate_0.15 [79] stringr_1.4.0 fastmap_1.1.0 yaml_2.3.5 [82] ragg_1.2.2 knitr_1.37 bit64_4.0.5 [85] fs_1.5.2 zip_2.2.0 purrr_0.3.4 [88] KEGGREST_1.34.0 gh_1.3.0 showtext_0.9-5 [91] mime_0.12 R.oo_1.24.0 xml2_1.3.3 [94] biomaRt_2.50.3 brio_1.1.3 compiler_4.1.0 [97] rstudioapi_0.13 interactiveDisplayBase_1.32.0 filelock_1.0.2 [100] curl_4.3.2 png_0.1-7 testthat_3.1.2 [103] paintmap_1.0 tibble_3.1.6 stringi_1.7.6 [106] GenomicFeatures_1.46.5 desc_1.4.1 lattice_0.20-45 [109] Matrix_1.4-0 vctrs_0.3.8 pillar_1.7.0 [112] lifecycle_1.0.1 BiocManager_1.30.16 data.table_1.14.2 [115] bitops_1.0-7 httpuv_1.6.5 rtracklayer_1.54.0 [118] GenomicRanges_1.46.1 R6_2.5.1 BiocIO_1.4.0 [121] promises_1.2.0.1 IRanges_2.28.0 ontoProc_1.16.0 [124] assertthat_0.2.1 pkgload_1.2.4 SummarizedExperiment_1.24.0 [127] rprojroot_2.0.2 rjson_0.2.21 withr_2.5.0 [130] GenomicAlignments_1.30.0 Rsamtools_2.10.0 S4Vectors_0.32.3 [133] GenomeInfoDbData_1.2.7 parallel_4.1.0 hms_1.1.1 [136] grid_4.1.0 ggfun_0.0.5 tidyr_1.2.0 [139] rmarkdown_2.13 MatrixGenerics_1.6.0 piggyback_0.1.1 [142] Biobase_2.54.0 shiny_1.7.1 lubridate_1.8.0 [145] restfulr_0.0.13 ```

bschilder commented 2 years ago

One place to start might be adding ".vcf.tsv", ".vcf.tsv.gz", and ".vcf.tsv.bgz" as an additional possibilities in MungeSumstats:::supported_suffixes():

Screenshot 2022-03-16 at 22 57 53

Al-Murphy commented 2 years ago

Seems to be an issue reading the header of this VCF (probably to do with the format?) with variant annotation:

VariantAnnotation::scanVcfHeader(path)
Error in scanBcfHeader(bf) : [internal] _hts_rewind() failed

Al-Murphy commented 2 years ago

Other than that it will run once I add #CHROM to the mapping file. So my solution is to add an error catch to VariantAnnotation::scanVcfHeader(path) in read_header() and use your other approach:

header <- readLines(path, n = 100)
            i <- which(startsWith(header, "#CHR"))
            header <- data.table::fread(text = header[seq(i, i + n)],
                                        nThread = 1)

For vcfs if this fails. Let me know if there is any downstream issue of not using variantAnnotation approach? I'll add these fixes into the solution I have for Indels to be pushed to the master branch (not current) so use the github version for this fix (until late April when it's released)

bschilder commented 2 years ago

Excellent, I think this sounds like a solid solution to me. Thanks, Alan!

I'll keep you posted about any downstream issues. Currently dealing with one from my more manual solution above, with some weird errors about the rows being out of order (when they don't seem to be) during tabix indexing:

Al-Murphy commented 2 years ago

Added to master branch

Al-Murphy / MungeSumstats