RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
30 stars 11 forks source link

Column names in topSNPs do not agree with co #121

Closed AMCalejandro closed 1 year ago

AMCalejandro commented 1 year ago

1. Bug description

The column names from topSNPs is not inheriting what we specify through echodata::construct_colmap

When I pass echodata::construct_colmap, I would expect the column names specify in the list are used to figure out column names in topSNPs.

NOTE When I change the BP name in topSNPs BP -> POS, then finemap_loci() runs. In this case the input GWAS has the bp col name set as BP, and the topSNPs BP col name is set as POS

2. Reproducible example

Code


columnsnames = echodata::construct_colmap(munged= FALSE,
                                          CHR = "CHR", POS = "BP",
                                          SNP = "SNP", P = "P",
                                          Effect = "BETA", StdErr = "SE", 
                                          A1 = "A1", A2 = "A2",
                                          N = "N", MAF = "MAF")

finemap_loci(# GENERAL ARGUMENTS 
  topSNPs = topSNPs,
  results_dir = fullRS_path,
  loci = topSNPs$Locus,
  dataset_name = "LID_COX",
  dataset_type = "GWAS",  

  force_new_subset = TRUE,
  force_new_LD = TRUE,
  force_new_finemap = TRUE,
  remove_tmps = FALSE,

  finemap_methods = c("ABF","FINEMAP","SUSIE"),

  # Munge full sumstats first
  munged = FALSE,
  colmap = columnsnames,
  # SUMMARY STATS ARGUMENTS
  fullSS_path = newSS_name_colmap,
  fullSS_genome_build = "hg19",
  query_by ="tabix",

  bp_distance = 500000*2,
  min_MAF = 0.001, 
  trim_gene_limits = FALSE,
  case_control = FALSE,

  # FINE-MAPPING ARGUMENTS
  ## General
  n_causal = 5,
  credset_thresh = .95,
  consensus_thresh = 2,

  # LD ARGUMENTS 
  LD_reference = "1KGphase3",#"UKB",
  superpopulation = "EUR",
  download_method = "axel",
  LD_genome_build = "hg19",
  leadSNP_LD_block = FALSE,

  #### PLotting args ####
  plot_types = c("fancy"),
  show_plot = TRUE,
  zoom = c("1x", "10x", "20x"),
  tx_biotypes =  c("protein_coding"),
  nott_epigenome = FALSE,
  nott_show_placseq = FALSE,
  nott_binwidth = 200,
  nott_bigwig_dir = NULL,
  #xgr_libnames =c("ENCODE_TFBS_ClusteredV3_CellTypes", "TFBS_Conserved", "Uniform_TFBS"),

  #roadmap = TRUE,
  #roadmap_query = c("brain"),

  #### General args ####
  seed = 2022,
  nThread = 20,
  verbose = TRUE
)

Console output

┌─────────────────────────────────────────────────┐
│                                                 │
│   )))> 🦇 RP11-240A16.1 [locus 1 / 3] 🦇 <(((   │
│                                                 │
└─────────────────────────────────────────────────┘

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Step 1 ▶▶▶ Query 🔎 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+ Query Method: tabix
Constructing GRanges query using min/max ranges within a single chromosome.
'start' or 'end' cannot contain NAsLocus RP11-240A16.1 complete in: 0 min

Data

> topSNPs
# A tibble: 3 × 7
  Locus         Gene            CHR       BP SNP                     P  BETA
  <chr>         <chr>         <dbl>    <dbl> <chr>               <dbl> <dbl>
1 RP11-240A16.1 RP11-240A16.1     4 32435284 rs189093213 0.00000000167  1.12

> str(data_colmaps)  # This is the input GWAS
'data.frame':   6530650 obs. of  11 variables:
 $ SNP : chr  "rs58276399" "rs142557973" "rs141242758" "rs2073813" ...
 $ CHR : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BP  : num  731718 731718 734349 753541 766007 ...
 $ A1  : chr  "t" "t" "t" "a" ...
 $ A2  : chr  "c" "c" "c" "g" ...
 $ FREQ: num  0.884 0.884 0.884 0.126 0.9 ...
 $ BETA: num  0.1775 0.1775 0.1577 0.0721 0.2559 ...
 $ SE  : num  0.158 0.158 0.159 0.118 0.164 ...
 $ P   : num  0.262 0.262 0.322 0.54 0.119 ...
 $ N   : int  1297 1297 1297 2687 1297 2687 2687 2687 2687 1297 ...
 $ MAF : num  0.1163 0.1163 0.1157 0.1257 0.0995 ...

> columnsnames
$munged
[1] FALSE

$CHR
[1] "CHR"

$POS
[1] "BP"

$SNP
[1] "SNP"

$P
[1] "P"

$Effect
[1] "BETA"

$StdErr
[1] "SE"

$tstat
[1] "tstat"

$Locus
[1] "Locus"

$Freq
[1] "Freq"

$MAF
[1] "MAF"

$A1
[1] "A1"

$A2
[1] "A2"

$Gene
[1] "Gene"

$N_cases
[1] "N_cases"

$N_controls
[1] "N_controls"

$proportion_cases
[1] "calculate"

$N
[1] "N"

$verbose
[1] TRUE

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

``` > sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 BSgenome_1.65.2 rtracklayer_1.57.0 [5] Biostrings_2.65.3 XVector_0.37.1 GenomicRanges_1.49.1 GenomeInfoDb_1.33.5 [9] IRanges_2.31.2 S4Vectors_0.35.3 BiocGenerics_0.43.1 forcats_0.5.2 [13] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4 readr_2.1.2 [17] tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 [21] data.table_1.14.2 echolocatoR_2.0.1 loaded via a namespace (and not attached): [1] rappdirs_0.3.3 GGally_2.1.2 R.methodsS3_1.8.2 echoLD_0.99.7 bit64_4.0.5 [6] knitr_1.40 irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0 rpart_4.1.16 [11] KEGGREST_1.37.3 RCurl_1.98-1.8 AnnotationFilter_1.21.0 generics_0.1.3 GenomicFeatures_1.49.6 [16] RSQLite_2.2.16 proxy_0.4-27 bit_4.0.4 tzdb_0.3.0 xml2_1.3.3 [21] lubridate_1.8.0 SummarizedExperiment_1.27.2 assertthat_0.2.1 viridis_0.6.2 gargle_1.2.0 [26] xfun_0.32 hms_1.1.2 evaluate_0.16 fansi_1.0.3 restfulr_0.0.15 [31] progress_1.2.2 dbplyr_2.2.1 readxl_1.4.1 Rgraphviz_2.41.1 igraph_1.3.4 [36] DBI_1.1.3 htmlwidgets_1.5.4 reshape_0.8.9 downloadR_0.99.4 googledrive_2.0.0 [41] ellipsis_0.3.2 backports_1.4.1 biomaRt_2.53.2 deldir_1.0-6 MatrixGenerics_1.9.1 [46] MungeSumstats_1.5.13 vctrs_0.4.1 Biobase_2.57.1 ensembldb_2.21.4 cachem_1.0.6 [51] withr_2.5.0 checkmate_2.1.0 GenomicAlignments_1.33.1 prettyunits_1.1.1 cluster_2.1.3 [56] ape_5.6-2 dir.expiry_1.5.0 lazyeval_0.2.2 crayon_1.5.1 basilisk.utils_1.9.2 [61] crul_1.2.0 pkgconfig_2.0.3 nlme_3.1-159 ProtGenerics_1.29.0 XGR_1.1.8 [66] nnet_7.3-17 rlang_1.0.5 lifecycle_1.0.1 filelock_1.0.2 httpcode_0.3.0 [71] BiocFileCache_2.5.0 modelr_0.1.9 echotabix_0.99.8 dichromat_2.0-0.1 cellranger_1.1.0 [76] coloc_5.1.0 matrixStats_0.62.0 graph_1.75.0 Matrix_1.4-1 osfr_0.2.8 [81] boot_1.3-28 reprex_2.0.2 base64enc_0.1-3 googlesheets4_1.0.1 png_0.1-7 [86] viridisLite_0.4.1 rjson_0.2.21 rootSolve_1.8.2.3 bitops_1.0-7 R.oo_1.25.0 [91] ggnetwork_0.5.10 blob_1.2.3 mixsqp_0.3-43 echoplot_0.99.5 dnet_1.1.7 [96] jpeg_0.1-9 echodata_0.99.14 scales_1.2.1 memoise_2.0.1 magrittr_2.0.3 [101] plyr_1.8.7 hexbin_1.28.2 zlibbioc_1.43.0 compiler_4.2.0 echoconda_0.99.7 [106] BiocIO_1.7.1 RColorBrewer_1.1-3 catalogueR_1.0.0 Rsamtools_2.13.4 cli_3.3.0 [111] echoannot_0.99.7 patchwork_1.1.2 htmlTable_2.4.1 Formula_1.2-4 MASS_7.3-58.1 [116] tidyselect_1.1.2 stringi_1.7.8 yaml_2.3.5 supraHex_1.35.0 latticeExtra_0.6-30 [121] ggrepel_0.9.1 grid_4.2.0 VariantAnnotation_1.43.3 tools_4.2.0 lmom_2.9 [126] parallel_4.2.0 rstudioapi_0.14 foreign_0.8-82 piggyback_0.1.3 gridExtra_2.3 [131] gld_2.6.5 digest_0.6.29 snpStats_1.47.1 BiocManager_1.30.18 Rcpp_1.0.9 [136] broom_1.0.1 OrganismDbi_1.39.1 httr_1.4.4 AnnotationDbi_1.59.1 RCircos_1.2.2 [141] ggbio_1.45.0 biovizBase_1.45.0 colorspace_2.0-3 rvest_1.0.3 XML_3.99-0.10 [146] fs_1.5.2 reticulate_1.26 splines_4.2.0 RBGL_1.73.0 expm_0.999-6 [151] echofinemap_0.99.3 basilisk_1.9.2 Exact_3.1 sessioninfo_1.2.2 jsonlite_1.8.0 [156] susieR_0.12.27 R6_2.5.1 Hmisc_4.7-1 pillar_1.8.1 htmltools_0.5.3 [161] glue_1.6.2 fastmap_1.1.0 DT_0.24 BiocParallel_1.31.12 class_7.3-20 [166] codetools_0.2-18 mvtnorm_1.1-3 utf8_1.2.2 lattice_0.20-45 curl_4.3.2 [171] DescTools_0.99.46 zip_2.2.0 openxlsx_4.2.5 interp_1.1-3 survival_3.3-1 [176] rmarkdown_2.16 googleAuthR_2.0.0 munsell_0.5.0 e1071_1.7-11 GenomeInfoDbData_1.2.8 [181] haven_2.5.1 reshape2_1.4.4 gtable_0.3.1 ```
bschilder commented 1 year ago

@AMCalejandro can you share your topSNPs object? I'm wondering if some columns there might be causing this.

bschilder commented 1 year ago

Actually, looking at your post again I think i understand it better.

The column names from topSNPs is not inheriting what we specify through echodata::construct_colmap

This is actually not the intended behaviour of colmap, for the reason that topSNPs (often a supplementary table somewhere in the publication) and fullSS_path (the full dataset) very frequently have different column naming schemes. So I think it's best to keep munging these as separate steps.

Thus, I recommend using the dedicated function to munge your topSNPs before passing it into finemap_loci:

topSNPs <- echodata::import_topSNPs(...)