RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
30 stars 11 forks source link

Problem with "N" using echodata::construct_colmaps #113

Closed AMCalejandro closed 1 year ago

AMCalejandro commented 1 year ago

1. Bug description

I am struggling with the N number on a quant GWAS e in echolocatoR.

2. Reproducible example

Code to run echolocator with construct_colmaps

Note Apart from the coe you see below, I also tried

Defining colmaps

columnsnames = echodata::construct_colmap(munged= FALSE,
                                          CHR = "CHR", POS = "POS",
                                          SNP = "SNP", P = "P",
                                          Effect = "BETA", StdErr = "SE", 
                                          A1 = "A1", A2 = "A2", Freq = "FREQ",
                                          sample_size = NULL,
                                          N_cases = NULL, N_controls = NULL,
                                          proportion_cases = NULL,
                                          MAF = "calculate",
                                          tstat = NULL)

Running echolocatoR

finemap_loci(# GENERAL ARGUMENTS 
                                          topSNPs = topSNPs,
                                          results_dir = fullRS_path,
                                          loci = topSNPs$Locus,
                                          dataset_name = "LID_COX",
                                          dataset_type = "GWAS",  
                                          force_new_subset = TRUE,
                                          force_new_LD = FALSE,
                                          force_new_finemap = TRUE,
                                          remove_tmps = FALSE,

                                          finemap_methods = c("ABF","FINEMAP","SUSIE", "POLYFUN_SUSIE"),

                                          # Munge full sumstats first
                                          munged = FALSE,
                                          colmap = columnsnames,
                                          # SUMMARY STATS ARGUMENTS
                                          fullSS_path = newSS_name_colmap,
                                          fullSS_genome_build = "hg19",
                                          query_by ="tabix",

                                          compute_n = 3500,

                                          bp_distance = 10000,#500000*2,
                                          min_MAF = 0.001, 
                                          trim_gene_limits = FALSE,

                                          case_control = FALSE,

                                          # FINE-MAPPING ARGUMENTS
                                          ## General
                                          n_causal = 5,
                                          credset_thresh = .95,
                                          consensus_thresh = 2,

                                          # LD ARGUMENTS 
                                          LD_reference = "1KGphase3",#"UKB",
                                          superpopulation = "EUR",
                                          download_method = "axel",
                                          LD_genome_build = "hg19",
                                          leadSNP_LD_block = FALSE,

                                          #### PLotting args ####
                                          plot_types = c("simple"),
                                          show_plot = TRUE,
                                          zoom = "1x",
                                          tx_biotypes = NULL,
                                          nott_epigenome = FALSE,
                                          nott_show_placseq = FALSE,
                                          nott_binwidth = 200,
                                          nott_bigwig_dir = NULL,
                                          xgr_libnames = NULL,
                                          roadmap = FALSE,
                                          roadmap_query = NULL,

                                          #### General args ####
                                          seed = 2022,
                                          nThread = 20,
                                          verbose = TRUE
                                          )

Console output

┌────────────────────────────────────────┐
│                                        │
│   )))> 🦇 LRP8 [locus 3 / 3] 🦇 <(((   │
│                                        │
└────────────────────────────────────────┘

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Step 1 ▶▶▶ Query 🔎 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+ Query Method: tabix
Constructing GRanges query using min/max ranges within a single chromosome.
query_dat is already a GRanges object. Returning directly.
========= echotabix::convert =========
Converting full summary stats file to tabix format for fast querying.
Inferred format: 'table'
Explicit format: 'table'
Inferring comment_char from tabular header: 'SNP'
Determining chrom type from file header.
Chromosome format: 1
Detecting column delimiter.
Identified column separator: \t
Sorting rows by coordinates via bash.
Searching for header row with grep.
( grep ^'SNP' .../QC_SNPs_COLMAP.txt; grep
    -v ^'SNP' .../QC_SNPs_COLMAP.txt | sort
    -k2,2n
    -k3,3n ) > .../file3b8c6b6c50de_sorted.tsv
Constructing outputs
Using existing bgzipped file: /home/rstudio/echolocatoR/echolocatoR_LID/QC_SNPs_COLMAP.txt.bgz 
Set force_new=TRUE to override this.
Tabix-indexing file using: Rsamtools
Data successfully converted to bgzip-compressed, tabix-indexed format.
========= echotabix::query =========
query_dat is already a GRanges object. Returning directly.
Inferred format: 'table'
Querying tabular tabix file using: Rsamtools.
Checking query chromosome style is correct.
Chromosome format: 1
Retrieving data.
Converting query results to data.table.
Processing query: 1:53768300-53788300
Adding 'query' column to results.
Retrieved data with 52 rows
Saving query ==> /home/rstudio/echolocatoR/echolocatoR_LID/RESULTS/GWAS/LID_COX/LRP8/LRP8_LID_COX_subset.tsv.gz
+ Query: 52 SNPs x 10 columns.
Standardizing summary statistics subset.
Standardizing main column names.
++ Preparing A1,A1 cols
++ Preparing MAF,Freq cols.
++ Could not infer MAF.
++ Preparing N_cases,N_controls cols.
++ Preparing proportion_cases col.
argument is of length zeroLocus LRP8 complete in: 0.29 min

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Step 6 ▶▶▶ Postprocess data 🎁 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Returning results as nested list.
All loci done in: 0.87 min
$`RP11-240A16.1`
NULL

$XYLT1
NULL

$LRP8
NULL

$merged_dat

Data

head(data_colmaps)
> str(data_colmaps)
'data.frame':   6517180 obs. of  10 variables:
 $ SNP : chr  "rs58276399" "rs141242758" "rs2073813" "rs61768174" ...
 $ CHR : num  1 1 1 1 1 1 1 1 1 1 ...
 $ POS : num  731718 734349 753541 766007 769223 ...
 $ A1  : chr  "t" "t" "a" "a" ...
 $ A2  : chr  "c" "c" "g" "c" ...
 $ FREQ: num  0.884 0.884 0.126 0.9 0.875 ...
 $ BETA: num  -0.1775 -0.1577 0.0721 -0.2559 -0.0772 ...
 $ SE  : num  0.158 0.159 0.118 0.164 0.118 ...
 $ P   : num  0.262 0.322 0.54 0.119 0.512 ...
 $ N   : int  1297 1297 2687 1297 2687 2687 2687 2687 1297 2687 ...

NOTE Please note in the input that the MAF is not being inferred from Freq column

3. Session info

``` > sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 BSgenome_1.65.2 rtracklayer_1.57.0 [5] Biostrings_2.65.3 XVector_0.37.1 GenomicRanges_1.49.1 GenomeInfoDb_1.33.5 [9] IRanges_2.31.2 S4Vectors_0.35.3 BiocGenerics_0.43.1 forcats_0.5.2 [13] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4 readr_2.1.2 [17] tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 [21] data.table_1.14.2 echolocatoR_2.0.1 loaded via a namespace (and not attached): [1] rappdirs_0.3.3 GGally_2.1.2 R.methodsS3_1.8.2 [4] echoLD_0.99.7 bit64_4.0.5 knitr_1.40 [7] irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0 [10] rpart_4.1.16 KEGGREST_1.37.3 RCurl_1.98-1.8 [13] AnnotationFilter_1.21.0 generics_0.1.3 GenomicFeatures_1.49.6 [16] RSQLite_2.2.16 proxy_0.4-27 bit_4.0.4 [19] tzdb_0.3.0 xml2_1.3.3 lubridate_1.8.0 [22] SummarizedExperiment_1.27.2 assertthat_0.2.1 viridis_0.6.2 [25] gargle_1.2.0 xfun_0.32 hms_1.1.2 [28] fansi_1.0.3 restfulr_0.0.15 progress_1.2.2 [31] dbplyr_2.2.1 readxl_1.4.1 Rgraphviz_2.41.1 [34] igraph_1.3.4 DBI_1.1.3 htmlwidgets_1.5.4 [37] reshape_0.8.9 downloadR_0.99.4 googledrive_2.0.0 [40] ellipsis_0.3.2 backports_1.4.1 biomaRt_2.53.2 [43] deldir_1.0-6 MatrixGenerics_1.9.1 MungeSumstats_1.5.13 [46] vctrs_0.4.1 Biobase_2.57.1 ensembldb_2.21.4 [49] cachem_1.0.6 withr_2.5.0 checkmate_2.1.0 [52] GenomicAlignments_1.33.1 prettyunits_1.1.1 cluster_2.1.3 [55] ape_5.6-2 dir.expiry_1.5.0 lazyeval_0.2.2 [58] crayon_1.5.1 basilisk.utils_1.9.2 crul_1.2.0 [61] pkgconfig_2.0.3 nlme_3.1-159 ProtGenerics_1.29.0 [64] XGR_1.1.8 nnet_7.3-17 rlang_1.0.5 [67] lifecycle_1.0.1 filelock_1.0.2 httpcode_0.3.0 [70] BiocFileCache_2.5.0 modelr_0.1.9 echotabix_0.99.8 [73] dichromat_2.0-0.1 cellranger_1.1.0 coloc_5.1.0 [76] matrixStats_0.62.0 graph_1.75.0 Matrix_1.4-1 [79] osfr_0.2.8 boot_1.3-28 reprex_2.0.2 [82] base64enc_0.1-3 googlesheets4_1.0.1 png_0.1-7 [85] viridisLite_0.4.1 rjson_0.2.21 rootSolve_1.8.2.3 [88] bitops_1.0-7 R.oo_1.25.0 ggnetwork_0.5.10 [91] blob_1.2.3 mixsqp_0.3-43 echoplot_0.99.5 [94] dnet_1.1.7 jpeg_0.1-9 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [97] echodata_0.99.12 scales_1.2.1 memoise_2.0.1 [100] magrittr_2.0.3 plyr_1.8.7 hexbin_1.28.2 [103] zlibbioc_1.43.0 compiler_4.2.0 echoconda_0.99.7 [106] BiocIO_1.7.1 RColorBrewer_1.1-3 catalogueR_1.0.0 [109] Rsamtools_2.13.4 cli_3.3.0 echoannot_0.99.7 [112] patchwork_1.1.2 htmlTable_2.4.1 Formula_1.2-4 [115] MASS_7.3-58.1 tidyselect_1.1.2 stringi_1.7.8 [118] yaml_2.3.5 supraHex_1.35.0 latticeExtra_0.6-30 [121] ggrepel_0.9.1 grid_4.2.0 VariantAnnotation_1.43.3 [124] tools_4.2.0 lmom_2.9 parallel_4.2.0 [127] rstudioapi_0.14 foreign_0.8-82 piggyback_0.1.3 [130] gridExtra_2.3 gld_2.6.5 digest_0.6.29 [133] snpStats_1.47.1 BiocManager_1.30.18 Rcpp_1.0.9 [136] broom_1.0.1 OrganismDbi_1.39.1 httr_1.4.4 [139] AnnotationDbi_1.59.1 RCircos_1.2.2 ggbio_1.45.0 [142] biovizBase_1.45.0 colorspace_2.0-3 rvest_1.0.3 [145] XML_3.99-0.10 fs_1.5.2 reticulate_1.26 [148] splines_4.2.0 RBGL_1.73.0 expm_0.999-6 [151] echofinemap_0.99.3 basilisk_1.9.2 Exact_3.1 [154] jsonlite_1.8.0 susieR_0.12.27 R6_2.5.1 [157] Hmisc_4.7-1 pillar_1.8.1 htmltools_0.5.3 [160] glue_1.6.2 fastmap_1.1.0 DT_0.24 [163] BiocParallel_1.31.12 class_7.3-20 codetools_0.2-18 [166] mvtnorm_1.1-3 utf8_1.2.2 lattice_0.20-45 [169] curl_4.3.2 DescTools_0.99.46 zip_2.2.0 [172] openxlsx_4.2.5 interp_1.1-3 survival_3.3-1 [175] googleAuthR_2.0.0 munsell_0.5.0 e1071_1.7-11 [178] GenomeInfoDbData_1.2.8 haven_2.5.1 reshape2_1.4.4 [181] gtable_0.3.1 ```
bschilder commented 1 year ago

This relates to #114 I've now updated construct_colmap to take the arg N=, which is the column name mapping tho the total sample size column.

bschilder commented 1 year ago

@AMCalejandro since your N is stored within each row of your full summary stats, I think all you need to do is supply "N" to the N arg in construct_colmap (which also happens to be the default):

columnsnames = echodata::construct_colmap(munged= FALSE,
                                          CHR = "CHR", POS = "POS",
                                          SNP = "SNP", P = "P",
                                          Effect = "BETA", StdErr = "SE", 
                                          A1 = "A1", A2 = "A2", Freq = "FREQ",
                                          N = "N",
                                          N_cases = NULL, N_controls = NULL,
                                          proportion_cases = NULL,
                                          MAF = "calculate",
                                          tstat = NULL)

Then you can just leave the default arg for compute_n:

finemap_loci(..., compute_n="ldsc", ...)

Because "N" is already a column in your full sumstats, echolocatoR won't try to impute N using MungeSumstats. Thus the compute_n= arg will be ignored anyway.