RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
34 stars 11 forks source link

Error in standardize_subset #109

Closed PhoebeGuo97 closed 1 year ago

PhoebeGuo97 commented 2 years ago

1. Bug description

Due to an issue as described in https://github.com/neurogenomics/MungeSumstats/issues/117, I ran MSS with tabix_index=FALSE and then ran tabix.createIndex(fullSS_path). After that, running finemap_loci got an error: Error in standardize_subset(locus = locus, top_SNPs = top_SNPs, fullSS_genome_build = fullSS_genome_build, : Could not find any rows in full data that matched query :(

Any help will be appreciated!

Expected behaviour

(A clear and concise description of what you expected to happen.)

2. Reproducible example

Code

fullSS_path <- "/mnt/belinda_local/fangfei/data/Jansen2019/AD_sumstats_Jansenetal_2019sept.txt.gz"
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, ref_genome = "GRCH37",tabix_index=FALSE,force_new=TRUE, save_path='/mnt/belinda_local/fangfei/data/Jansen2019/formatted_Jansen2019.tsv.gz')

tabix.createIndex(fullSS_path)

Jansen_2019.results <- finemap_loci(
  # GENERAL ARGUMENTS
  top_SNPs = top_SNPs,
  #  It's best to give absolute paths
  results_dir = "/mnt/belinda_local/fangfei/data/echolocatoR",
  loci = top_SNPs$Locus,
  dataset_name = "Jansen_2019",
  dataset_type = "GWAS",
  force_new_subset = FALSE,
  force_new_LD = FALSE,
  force_new_finemap = TRUE,
  remove_tmps = FALSE,

  # Munge full sumstats first
  munged = TRUE,

  # SUMMARY STATS ARGUMENTS
  fullSS_path = fullSS_path,
  fullSS_genome_build = "hg19",
  query_by ="tabix",
  MAF_col = "EAF",
  effect_col = "BETA",
  stderr_col = "SE",
  sample_size = "Nsum",
  N_cases_col = "Neff",
  A1_col = "A1",
  A2_col = "A2",

  # FILTERING ARGUMENTS
  ## It's often desirable to use a larger window size
  ## (e.g. 2Mb which is bp_distance=500000*2),
  ## but we use a small window here to speed up the process.
  bp_distance = 10000,#500000*2,
  min_MAF = 0.001,
  trim_gene_limits = FALSE,

  # FINE-MAPPING ARGUMENTS
  ## General
  finemap_methods = c("ABF","FINEMAP","SUSIE","POLYFUN_SUSIE"),
  n_causal = 5,
  PP_threshold = .95,

  # LD ARGUMENTS
  LD_reference = "UKB",#"UKB",
  #superpopulation = "EUR",
  download_method = "axel",

  # PLOT ARGUMENTS
  ## general
  plot.types = c("simple"),
  ## Generate multiple plots of different window sizes;
  ### all SNPs, 4x zoomed-in, and a 50000bp window
  plot.zoom = c("all","4x","10x"),
  ## XGR
  # plot.XGR_libnames=c("ENCODE_TFBS_ClusteredV3_CellTypes"),
  ## Roadmap
  plot.Roadmap = FALSE,
  plot.Roadmap_query = NULL,
  # Nott et al. (2019)
  plot.Nott_epigenome = F,
  plot.Nott_show_placseq = F,

  verbose = TRUE
)

Console output

[1] "+ CONDA:: 'echoR' conda environment not found. Using default 'base' instead."
[1] "Checking for tabix installation..."
[1] "Checking for bcftools installation..."
[1] "+ Assuming sumstats have already been processed with MungeSumstats"

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
1 (1 / 29)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "+ Extracting relevant variants from fullSS..."
[1] "+ Query Method: tabix"
[1] "+ QUERY: Chromosome = 1 ; Min position = 161145392 ; Max position = 161165392"
[1] "TABIX:: Copying existing tabix file ==> /mnt/belinda_local/fangfei/data/Jansen2019/formatted_Jansen2019.tsv.gz"
[1] "Determining chrom type from file header"
[1] "Chromosome format = 1"
[1] "TABIX:: Extracting subset of sum stats"
[1] "+ TABIX:: Returning 0 x 0 data.table"
[1] "++ Saving query ==> /mnt/belinda_local/fangfei/data/echolocatoR/GWAS/Jansen_2019/1/1_Jansen_2019_subset.tsv.gz"
[1] "LD:: Standardizing summary statistics subset."
Error in standardize_subset(locus = locus, top_SNPs = top_SNPs, fullSS_genome_build = fullSS_genome_build,  : 
  Could not find any rows in full data that matched query :(
In addition: Warning messages:
1: In data.table::fwrite(dat, file = subset_path, nThread = nThread,  :
  Input has no columns; creating an empty file at '/mnt/belinda_local/fangfei/data/echolocatoR/GWAS/Jansen_2019/1/1_Jansen_2019_subset.tsv.gz' and exiting.
2: In data.table::fread(subset_path, nrows = 2) :
  File '/mnt/belinda_local/fangfei/data/echolocatoR/GWAS/Jansen_2019/1/1_Jansen_2019_subset.tsv.gz' has size 0. Returning a NULL data.table.

Data

https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

``` R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] seqminer_8.4 GenomeInfoDb_1.33.7 IRanges_2.31.2 S4Vectors_0.35.3 BiocGenerics_0.43.4 loaded via a namespace (and not attached): [1] utf8_1.2.2 reticulate_1.26 R.utils_2.12.0 tidyselect_1.1.2 [5] RSQLite_2.2.17 AnnotationDbi_1.59.1 htmlwidgets_1.5.4 grid_4.2.1 [9] BiocParallel_1.31.12 XGR_1.1.7 munsell_0.5.0 codetools_0.2-18 [13] interp_1.1-3 DT_0.25 colorspace_2.0-3 OrganismDbi_1.39.1 [17] Biobase_2.57.1 filelock_1.0.2 knitr_1.40 supraHex_1.35.0 [21] rstudioapi_0.14 DescTools_0.99.46 MatrixGenerics_1.9.1 GenomeInfoDbData_1.2.8 [25] mixsqp_0.3-43 bit64_4.0.5 echoconda_0.99.6 basilisk_1.9.10 [29] vctrs_0.4.1 generics_0.1.3 xfun_0.33 biovizBase_1.45.0 [33] BiocFileCache_2.5.0 R6_2.5.1 AnnotationFilter_1.21.0 bitops_1.0-7 [37] cachem_1.0.6 reshape_0.8.9 DelayedArray_0.23.2 assertthat_0.2.1 [41] BiocIO_1.7.1 scales_1.2.1 nnet_7.3-17 rootSolve_1.8.2.3 [45] gtable_0.3.1 lmom_2.9 ggbio_1.45.0 ensembldb_2.21.4 [49] rlang_1.0.5 MungeSumstats_1.5.13 echodata_0.99.12 splines_4.2.1 [53] lazyeval_0.2.2 rtracklayer_1.57.0 gargle_1.2.1 dichromat_2.0-0.1 [57] hexbin_1.28.2 checkmate_2.1.0 BiocManager_1.30.18 yaml_2.3.5 [61] reshape2_1.4.4 snpStats_1.47.1 GenomicFeatures_1.49.6 ggnetwork_0.5.10 [65] backports_1.4.1 Hmisc_4.7-1 RBGL_1.73.0 tools_4.2.1 [69] echoplot_0.99.3 ggplot2_3.3.6 ellipsis_0.3.2 RColorBrewer_1.1-3 [73] proxy_0.4-27 coloc_5.2.0 Rcpp_1.0.9 plyr_1.8.7 [77] base64enc_0.1-3 progress_1.2.2 zlibbioc_1.43.0 purrr_0.3.4 [81] RCurl_1.98-1.8 basilisk.utils_1.9.3 prettyunits_1.1.1 rpart_4.1.16 [85] deldir_1.0-6 viridis_0.6.2 SummarizedExperiment_1.27.3 ggrepel_0.9.1 [89] cluster_2.1.4 fs_1.5.2 magrittr_2.0.3 data.table_1.14.2 [93] echotabix_0.99.7 dnet_1.1.7 openxlsx_4.2.5 mvtnorm_1.1-3 [97] ProtGenerics_1.29.0 matrixStats_0.62.0 patchwork_1.1.2 hms_1.1.2 [101] XML_3.99-0.10 echolocatoR_2.0.0 jpeg_0.1-9 readxl_1.4.1 [105] gridExtra_2.3 compiler_4.2.1 biomaRt_2.53.2 tibble_3.1.8 [109] crayon_1.5.1 R.oo_1.25.0 htmltools_0.5.3 echoannot_0.99.7 [113] tzdb_0.3.0 Formula_1.2-4 tidyr_1.2.1 expm_0.999-6 [117] Exact_3.1 DBI_1.1.3 dbplyr_2.2.1 MASS_7.3-58.1 [121] rappdirs_0.3.3 boot_1.3-28 Matrix_1.5-1 readr_2.1.2 [125] piggyback_0.1.4 cli_3.4.0 R.methodsS3_1.8.2 echofinemap_0.99.2 [129] parallel_4.2.1 igraph_1.3.4 GenomicRanges_1.49.1 pkgconfig_2.0.3 [133] GenomicAlignments_1.33.1 dir.expiry_1.5.1 RCircos_1.2.2 foreign_0.8-82 [137] xml2_1.3.3 XVector_0.37.1 echoLD_0.99.6 stringr_1.4.1 [141] VariantAnnotation_1.43.3 digest_0.6.29 graph_1.75.0 Biostrings_2.65.6 [145] cellranger_1.1.0 htmlTable_2.4.1 gld_2.6.5 restfulr_0.0.15 [149] curl_4.3.2 Rsamtools_2.13.4 rjson_0.2.21 lifecycle_1.0.2 [153] nlme_3.1-159 jsonlite_1.8.0 viridisLite_0.4.1 BSgenome_1.65.2 [157] fansi_1.0.3 susieR_0.12.27 downloadR_0.99.3 pillar_1.8.1 [161] lattice_0.20-45 GGally_2.1.2 KEGGREST_1.37.3 fastmap_1.1.0 [165] httr_1.4.4 survival_3.4-0 googleAuthR_2.0.0 glue_1.6.2 [169] remotes_2.4.2 zip_2.2.1 png_0.1-7 bit_4.0.4 [173] Rgraphviz_2.41.1 class_7.3-20 stringi_1.7.8 blob_1.2.3 [177] latticeExtra_0.6-30 memoise_2.0.1 dplyr_1.0.10 irlba_2.3.5 [181] e1071_1.7-11 ape_5.6-2 ```
bschilder commented 2 years ago

Hi @PhoebeGuo97, it looks like you're following the documentation from echolocatoR 1.0 with echolocatoR 2.0. Try running the example at the bottom of the help page that appears when you use: ?echolocatoR::finemap_loci

I realize this is a bit confusing as the documentation website has not yet been updated (i have to get it passing all GitHub Actions checks first). So in the meantime, please follow the documentation via the ?<function_name> help pages instead.

PhoebeGuo97 commented 2 years ago

Hi @bschilder, I tried using echolocatoR 2.0 and running the example. I got a new error: Error: object 'find_executables_remote' is not exported by 'namespace:echoconda'.

bschilder commented 1 year ago

Hi @bschilder, I tried using echolocatoR 2.0 and running the example. I got a new error: Error: object 'find_executables_remote' is not exported by 'namespace:echoconda'.

Hi @PhoebeGuo97, sorry for the delay. This was due to a update in echoconda where the function name had changed. Can you try updating all the echoverse packages and try again?

devtools::install_github("RajLabMSSM/echolocatoR", dependencies = TRUE)

Best, Brian

bschilder commented 1 year ago

After reinstalling, can you try the following example @PhoebeGuo97 ? This seems to work for me.

#### Munge sumstats ####
fullSS_path <- MungeSumstats::format_sumstats(path = "https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz",
                                              ref_genome = "GRCH37",
                                              tabix_index=FALSE,
                                              force_new=TRUE)
#### Prepare topSNPs dataframe ####
topSNPs <- echodata::import_topSNPs(topSS = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-018-0311-9/MediaObjects/41588_2018_311_MOESM3_ESM.xlsx",
                                    sheet = "Table S2",
                                    startRow=5,
                                    cols=1:13,
                                    munge = TRUE)
#### Run echolocatoR #### 
res <- echolocatoR::finemap_loci(
  fullSS_path = fullSS_path,
  topSNPs = topSNPs[1:2,], # test just the first 2 loci
  finemap_methods = c("ABF","FINEMAP","SUSIE"),
  dataset_name = "Jansen2019",
  fullSS_genome_build = "hg19",
  bp_distance = 1000,
  munged = TRUE) 
PhoebeGuo97 commented 1 year ago

After reinstalling, can you try the following example @PhoebeGuo97 ? This seems to work for me.

#### Munge sumstats ####
fullSS_path <- MungeSumstats::format_sumstats(path = "https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz",
                                              ref_genome = "GRCH37",
                                              tabix_index=FALSE,
                                              force_new=TRUE)
#### Prepare topSNPs dataframe ####
topSNPs <- echodata::import_topSNPs(topSS = "https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-018-0311-9/MediaObjects/41588_2018_311_MOESM3_ESM.xlsx",
                                    sheet = "Table S2",
                                    startRow=5,
                                    cols=1:13,
                                    munge = TRUE)
#### Run echolocatoR #### 
res <- echolocatoR::finemap_loci(
  fullSS_path = fullSS_path,
  topSNPs = topSNPs[1:2,], # test just the first 2 loci
  finemap_methods = c("ABF","FINEMAP","SUSIE"),
  dataset_name = "Jansen2019",
  fullSS_genome_build = "hg19",
  bp_distance = 1000,
  munged = TRUE) 

Hi @bschilder, thanks for the update. I can successfully run the example. However, if I add "LD_reference = "UKB" in fine map_loci(), I will get ModuleNotFoundError: No module named 'scipy' at Step 2 ▶▶▶ Extract Linkage Disequilibrium and other steps won't be finished. Could you please let me know how to fix it? Thanks!