Error in opening specified tabix file and creating tabix index

PhoebeGuo97 commented 2 years ago

1. Bug description

I have munged my summary statistics. I ran fine map_loci()and got errors.

Console output

[1] "+ CONDA:: Activating conda env 'echoR'"
[1] "Checking for tabix installation..."
[1] "Checking for bcftools installation..."
[1] "+ Assuming sumstats have already been processed with MungeSumstats"

) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( (
1 (1 / 29)
) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( (
[1] "+ Extracting relevant variants from fullSS..."
[1] "+ Query Method: tabix"
[1] "+ QUERY: Chromosome = 1 ; Min position = 161145392 ; Max position = 161165392"
[1] "TABIX:: Converting full summary stats file to tabix format for fast querying..."
[1] "TABIX:: Ensuring file is bgzipped."
[1] "TABIX:: Tabix-indexing file."
[ti_index_core] the file out of order at line 1253647
Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz ]!
[1] "Determining chrom type from file header"
[1] "Chromosome format = 1"
[1] "TABIX:: Extracting subset of sum stats"
Cannot open specified tabix file: /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz
Cannot open specified tabix file: /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz
Error in strsplit(body, "\t") : non-character argument
Fine-mapping complete in:
Time difference of 1.9 mins

) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( (
10 (2 / 29)
) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( (
[1] "+ Extracting relevant variants from fullSS..."
[1] "+ Query Method: tabix"
[1] "+ QUERY: Chromosome = 7 ; Min position = 99961834 ; Max position = 99981834"
[1] "TABIX:: Converting full summary stats file to tabix format for fast querying..."
[1] "TABIX:: Ensuring file is bgzipped."
[1] "TABIX:: Tabix-indexing file."
[ti_index_core] the file out of order at line 1253647
Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz ]!

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
9 (29 / 29)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "+ Extracting relevant variants from fullSS..."
[1] "+ Query Method: tabix"
[1] "+ QUERY: Chromosome = 6 ; Min position = 47422637 ; Max position = 47442637"
[1] "TABIX:: Converting full summary stats file to tabix format for fast querying..."
[1] "TABIX:: Ensuring file is bgzipped."
[1] "TABIX:: Tabix-indexing file."
[ti_index_core] the file out of order at line 1253647
Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz ]!
[1] "Determining chrom type from file header"
[1] "Chromosome format = 1"
[1] "TABIX:: Extracting subset of sum stats"
Cannot open specified tabix file: /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz
Cannot open specified tabix file: /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a364f2c78f.tsv.bgz
Error in strsplit(body, "\t") : non-character argument
Fine-mapping complete in:
Time difference of 1.9 mins
[1] "+ Identifying Consensus SNPs..."
[1] "++ support_thresh = 2"
[1] "++ top_CS_only=FALSE"
[1] "+ Calculating mean Posterior Probability (mean.PP)..."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: SNP

Expected behaviour

(A clear and concise description of what you expected to happen.)

2. Reproducible example

Code

This is how I munged my summary statistics: fullSS_path <- "/Users/fguo/Desktop/GWAS/Jansen2019/AD_sumstats_Jansenetal_2019sept.txt.gz" fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, ref_genome = "GRCH37")

Jansen_2019.results <- finemap_loci(
  # GENERAL ARGUMENTS
  top_SNPs = top_SNPs,
  #  It's best to give absolute paths
  results_dir = "/Users/fguo/Desktop/GWAS/echolocatoR",
  loci = top_SNPs$Locus,
  dataset_name = "Jansen_2019",
  dataset_type = "GWAS",
  force_new_subset = FALSE,
  force_new_LD = FALSE,
  force_new_finemap = TRUE,
  remove_tmps = FALSE,

  # Munge full sumstats first
  munged = TRUE,

  # SUMMARY STATS ARGUMENTS
  fullSS_path = fullSS_path,
  fullSS_genome_build = "hg19",
  query_by ="tabix",
  MAF_col = "EAF",
  effect_col = "BETA",
  stderr_col = "SE",
  sample_size = "Nsum",
  N_cases_col = "Neff",
  A1_col = "A1",
  A2_col = "A2",

  # FILTERING ARGUMENTS
  ## It's often desirable to use a larger window size
  ## (e.g. 2Mb which is bp_distance=500000*2),
  ## but we use a small window here to speed up the process.
  bp_distance = 10000,#500000*2,
  min_MAF = 0.001,
  trim_gene_limits = FALSE,

  # FINE-MAPPING ARGUMENTS
  ## General
  finemap_methods = c("ABF","FINEMAP","SUSIE","POLYFUN_SUSIE"),
  n_causal = 5,
  PP_threshold = .95,

  # LD ARGUMENTS
  LD_reference = "UKB",#"UKB",
  #superpopulation = "EUR",
  download_method = "axel",

  # PLOT ARGUMENTS
  ## general
  plot.types = c("simple"),
  plot.zoom = c("all","4x","10x"),
  ## XGR
  # plot.XGR_libnames=c("ENCODE_TFBS_ClusteredV3_CellTypes"),
  ## Roadmap
  plot.Roadmap = FALSE,
  plot.Roadmap_query = NULL,
  # Nott et al. (2019)
  plot.Nott_epigenome = F,
  plot.Nott_show_placseq = F,

  verbose = TRUE
)

Data

This is the output of the step of MungeSumstats Successfully finished preparing sumstats file, preview: Reading header. SNP CHR BP A1 A2 UNIQID.A1A2 Z P NSUM N DIRECTION FRQ BETA SE 1: rs12184267 1 715265 C T 1:715265_T_C -2.121973 0.03384 359856 359856 ??+? 0.9591931 -0.01264265 0.005957967 2: rs12184277 1 715367 A G 1:715367_G_A -1.957915 0.05024 360230 360230 ??+? 0.9589313 -0.01162351 0.005936678 3: rs12184279 1 717485 C A 1:717485_A_C -1.912438 0.05582 360257 360257 ??+? 0.9594241 -0.01141891 0.005970864 4: rs116801199 1 720381 G T 1:720381_T_G -2.295404 0.02171 360980 360980 ??+? 0.9578380 -0.01344289 0.005856439 Returning path to saved data.

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] pbmcapply_1.5.1 patchwork_1.1.1 plotly_4.10.0 ggplot2_3.3.6 dplyr_1.0.9 data.table_1.14.2 [7] MungeSumstats_1.4.5 echolocatoR_0.2.3 loaded via a namespace (and not attached): [1] bitops_1.0-7 matrixStats_0.62.0 [3] fs_1.5.2 bit64_4.0.5 [5] filelock_1.0.2 progress_1.2.2 [7] httr_1.4.3 GenomeInfoDb_1.32.2 [9] googleAuthR_2.0.0 bslib_0.4.0 [11] tools_4.2.1 DT_0.23 [13] utf8_1.2.2 R6_2.5.1 [15] lazyeval_0.2.2 DBI_1.1.3 [17] BiocGenerics_0.42.0 colorspace_2.0-3 [19] withr_2.5.0 tidyselect_1.1.2 [21] prettyunits_1.1.1 bit_4.0.4 [23] curl_4.3.2 compiler_4.2.1 [25] cli_3.3.0 Biobase_2.56.0 [27] xml2_1.3.3 DelayedArray_0.22.0 [29] sass_0.4.2 rtracklayer_1.56.1 [31] scales_1.2.0 rappdirs_0.3.3 [33] stringr_1.4.0 digest_0.6.29 [35] Rsamtools_2.12.0 rmarkdown_2.14 [37] R.utils_2.12.0 XVector_0.36.0 [39] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 pkgconfig_2.0.3 [41] htmltools_0.5.3 MatrixGenerics_1.8.1 [43] dbplyr_2.2.1 fastmap_1.1.0 [45] BSgenome_1.64.0 htmlwidgets_1.5.4 [47] rlang_1.0.4 rstudioapi_0.13 [49] RSQLite_2.2.15 jquerylib_0.1.4 [51] BiocIO_1.6.0 generics_0.1.3 [53] jsonlite_1.8.0 crosstalk_1.2.0 [55] BiocParallel_1.30.3 zip_2.2.0 [57] R.oo_1.25.0 VariantAnnotation_1.42.1 [59] RCurl_1.98-1.8 magrittr_2.0.3 [61] GenomeInfoDbData_1.2.8 Matrix_1.4-1 [63] Rcpp_1.0.9 munsell_0.5.0 [65] S4Vectors_0.34.0 fansi_1.0.3 [67] reticulate_1.25 lifecycle_1.0.1 [69] R.methodsS3_1.8.2 stringi_1.7.8 [71] yaml_2.3.5 SummarizedExperiment_1.26.1 [73] zlibbioc_1.42.0 BiocFileCache_2.4.0 [75] grid_4.2.1 blob_1.2.3 [77] crayon_1.5.1 lattice_0.20-45 [79] Biostrings_2.64.0 GenomicFeatures_1.48.3 [81] hms_1.1.1 KEGGREST_1.36.3 [83] seqminer_8.4 knitr_1.39 [85] pillar_1.8.0 GenomicRanges_1.48.0 [87] rjson_0.2.21 codetools_0.2-18 [89] biomaRt_2.52.0 stats4_4.2.1 [91] XML_3.99-0.10 glue_1.6.2 [93] evaluate_0.15 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 [95] png_0.1-7 vctrs_0.4.1 [97] tidyr_1.2.0 gtable_0.3.0 [99] purrr_0.3.4 assertthat_0.2.1 [101] cachem_1.0.6 openxlsx_4.2.5 [103] xfun_0.31 restfulr_0.0.15 [105] viridisLite_0.4.0 gargle_1.2.0 [107] tibble_3.1.8 GenomicAlignments_1.32.1 [109] AnnotationDbi_1.58.0 memoise_2.0.1 [111] IRanges_2.30.0 ellipsis_0.3.2 ```

bschilder commented 2 years ago

Thanks for the information @PhoebeGuo97

The error stems from the fact that your sumstats are not sorted by genomic coordinates, which is a requirement for tabix

[ti_index_core] the file out of order at line 1253647

Option 1

Use the argument tabix_index=TRUE in MungeSumstats::format_sumstats when processing your GWAS. This will automatically ensure that the file is sorted first before indexing.

Option 2

Alternatively, you can sort the file yourself and resave it. Using the sort_coordinates() function in the echotabix is one way to do this (which is under development but should still work). https://github.com/RajLabMSSM/echotabix

PhoebeGuo97 commented 2 years ago

Hello, I tried the option 1, i.e. fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, ref_genome = "GRCH37",tabix_index=TRUE) but it did not solve the issue.

Here is what I saw in Console after re-running MungeSumstats: Tabix-indexing file. [ti_index_core] the file out of order at line 1253647 Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a3e5725dc.tsv.bgz ]! Summary statistics report:

11,773,806 rows (88.1% of original 13,367,299 rows)
- 11,773,806 unique variants
- 1,856 genome-wide significant variants (P<5e-8)
- 22 chromosomes Successfully finished preparing sumstats file, preview: Reading header. SNP CHR BP A1 A2 UNIQID.A1A2 Z P NSUM N DIRECTION FRQ BETA SE 1: rs12184267 1 715265 C T 1:715265_T_C -2.121973 0.03384 359856 359856 ??+? 0.9591931 -0.01264265 0.005957967 2: rs12184277 1 715367 A G 1:715367_G_A -1.957915 0.05024 360230 360230 ??+? 0.9589313 -0.01162351 0.005936678 3: rs12184279 1 717485 C A 1:717485_A_C -1.912438 0.05582 360257 360257 ??+? 0.9594241 -0.01141891 0.005970864 4: rs116801199 1 720381 G T 1:720381_T_G -2.295404 0.02171 360980 360980 ??+? 0.9578380 -0.01344289 0.005856439 5: rs12565286 1 721290 G C 1:721290_C_G -2.315602 0.02058 360823 360823 ??+? 0.9576224 -0.01353111 0.005843450 Returning path to saved data. Warning message: In [.data.table(sumstats_dt, , :=((names(empty_cols)), NULL)) : Column 'NA' does not exist to remove

bschilder commented 2 years ago

Tabix sorting/indexing is now generally more stable in the echoverse branch. Could you try using this updated version? (it will be pushed to master soon as well):

~~remotes::install_github("RajLabMSSM/echolocatoR@echoverse")~~

remotes::install_github("RajLabMSSM/echolocatoR")

PhoebeGuo97 commented 2 years ago

I wonder if I need to change any scripts to use the updated version? Thanks!

bschilder commented 2 years ago

Hi @PhoebeGuo97, yes quite a bit has changed. I've just uploaded some details about these changes here: https://github.com/RajLabMSSM/echolocatoR#echolocator-v10-vsv20

RajLabMSSM / echolocatoR