RajLabMSSM / echofinemap

echoverse module: Statistical and functional fine-mapping functions.
2 stars 1 forks source link

FINEMAP credible sets not included in the results #7

Open jdblischak opened 3 years ago

jdblischak commented 3 years ago

While exploring the results in the echolocatoR Shiny app, I noticed that FINEMAP never returns more than 1 credible set per locus. This was confusing since SuSiE often returns more than 1 credible set.

I started diving into the code, and I think the issue is that it only looks for data.cred, when FINEMAP returns data.cred# where # is the number of credible sets. PolyFun obtains the credible sets by searching backwards from the maximum number of causal SNPs until it finds a .cred file (source):

        #add causal set info
        df_finemap['CREDIBLE_SET'] = 0
        cred_file = None
        for m in range(num_causal_snps, 0, -1):
            if os.path.exists(cred_filename+str(m)):
                cred_file = cred_filename+str(m)
                break
        if cred_file is None:
            raise IOError('cred file not found')
        df_cred = pd.read_table(cred_file, sep=' ', usecols=(lambda c: c.startswith('cred')), comment='#')
        df_finemap.set_index('SNP', inplace=True, drop=False)
        for c_i, c in enumerate(df_cred.columns):
            df_finemap.loc[df_cred[c].dropna(), 'CREDIBLE_SET'] = c_i+1
        df_finemap.reset_index(inplace=True, drop=True)

echolocatoR only checks for data.cred:

https://github.com/RajLabMSSM/echolocatoR/blob/b055ac0fb74c914d7600e3afc650a2ffd7149396/R/FINEMAP.R#L336

And then when it is extracting the PIPs from the .snp file, it assigns any SNP that meets the PIP threshold to CS 1.

https://github.com/RajLabMSSM/echolocatoR/blob/b055ac0fb74c914d7600e3afc650a2ffd7149396/R/FINEMAP.R#L363-L364

I attempted to put together a minimal, reproducible example to demonstrate this behavior. However, using the latest version on master, FINEMAP is currently returning NA for both the .CS and the .PP columns.

See here for minimal, reproducible example ```R library(echolocatoR) stopifnot(packageVersion("echolocatoR") == "0.1.2") data("Nalls_top_SNPs") top_SNPs <- import_topSNPs( topSS = Nalls_top_SNPs, position_col = "BP", pval_col = "P, all studies", effect_col = "Beta, all studies", gene_col = "Nearest Gene", locus_col = "Nearest Gene", remove_variants = "rs34637584" ) fullSS_path <- example_fullSS() Nalls23andMe_2019.results <- finemap_loci( top_SNPs = top_SNPs, results_dir = file.path(getwd(), "results"), loci = "BST1", dataset_name = "Nalls23andMe_2019", remove_tmps = FALSE, fullSS_path = fullSS_path, query_by = "tabix", snp_col = "RSID", pval_col = "p", effect_col = "beta", stderr_col = "se", freq_col = "freq", MAF_col = "calculate", bp_distance = 10000, min_MAF = 0.001, finemap_methods = c("FINEMAP", "SUSIE"), LD_reference = "UKB", download_method = "axel", plot.types = c() ) Nalls23andMe_2019.results[SUSIE.CS > 0, list(SNP, FINEMAP.CS, FINEMAP.PP, SUSIE.CS, SUSIE.PP)] readLines("results/GWAS/Nalls23andMe_2019/BST1/FINEMAP/data.snp", n = 3) readLines("results/GWAS/Nalls23andMe_2019/BST1/FINEMAP/data.cred5") sessionInfo() ```
See here for the results ```R > source("reprex.R", echo = TRUE) > library(echolocatoR) Registered S3 method overwritten by 'GGally': method from +.gg ggplot2 Possible Ensembl SSL connectivity problems detected. Please see the 'Connection Troubleshooting' section of the biomaRt vignette vignette('accessing_ensembl', package = 'biomaRt')Error in curl::curl_fetch_memory(url, handle = handle) : SSL certificate problem: self signed certificate in certificate chain Bioconductor version 3.12 (BiocManager 1.30.12), ?BiocManager::install for help > stopifnot(packageVersion("echolocatoR") == "0.1.2") > data("Nalls_top_SNPs") > top_SNPs <- import_topSNPs( + topSS = Nalls_top_SNPs, + position_col = "BP", + pval_col = "P, all studies", + effect_col = "Beta, all studie ..." ... [TRUNCATED] [1] "+ Assigning gene_col and locus_col independently" > fullSS_path <- example_fullSS() > Nalls23andMe_2019.results <- finemap_loci( + top_SNPs = top_SNPs, + results_dir = file.path(getwd(), "results"), + loci = "BST1", + dataset_ .... [TRUNCATED] [1] "+ CONDA:: Activating conda env 'echoR'" [1] "Checking for tabix installation..." [1] "Checking for bcftools installation..." ) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( ( BST1 (1 / 1) ) ) ) ))))))}}}}}}}} {{{{{{{{{(((((( ( ( ( [1] "+ Extracting relevant variants from fullSS..." [1] "+ Query Method: tabix" [1] "+ QUERY: Chromosome = 4 ; Min position = 15727348 ; Max position = 15747348" [1] "TABIX:: Converting full summary stats file to tabix format for fast querying..." [1] "+ CONDA:: Identified bgzip executable in echoR env." [1] "( grep 'CHR' ./Nalls23andMe_2019.fullSS_subset.tsv; grep -v ^'CHR' ./Nalls23andMe_2019.fullSS_subset.tsv | sort -k1,1 -k2,2n ) | ~/mambaforge/envs/echoR/bin/bgzip -f > ~/echolocatoR/results/GWAS/Nalls23andMe_2019/Nalls23andMe_2019.fullSS_subset.tsv.gz" [1] "+ CONDA:: Identified tabix executable in echoR env." [1] "TABIX:: Indexing" [1] "~/mambaforge/envs/echoR/bin/tabix -f -S 1 -s 1 -b 2 -e 2 ~/echolocatoR/results/GWAS/Nalls23andMe_2019/Nalls23andMe_2019.fullSS_subset.tsv.gz" [1] "Determining chrom type from file header" [1] "Chromosome format = 1" [1] "+ CONDA:: Identified tabix executable in echoR env." [1] "TABIX:: Extracting subset of sum stats" [1] "+ TABIX:: ~/mambaforge/envs/echoR/bin/tabix -h ~/echolocatoR/results/GWAS/Nalls23andMe_2019/Nalls23andMe_2019.fullSS_subset.tsv.gz 4:15727348-15747348" [1] "+ TABIX:: Returning 115 x 11 data.table" [1] "++ Saving query ==> ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/BST1_Nalls23andMe_2019_subset.tsv.gz" [1] "LD:: Standardizing summary statistics subset." [1] "++ Preparing Gene col" [1] "++ Preparing A1,A1 cols" [1] "++ Preparing MAF,Freq cols" [1] "++ Inferring MAF from frequency column..." [1] "++ Preparing N_cases,N_controls cols" [1] "++ Preparing `proportion_cases` col" [1] "++ Calculating `proportion_cases`." [1] "++ Preparing N col" [1] "+ Preparing sample_size (N) column" [1] "++ Computing effective sample size." [1] "++ Preparing t-stat col" [1] "+ Calculating t-statistic from Effect and StdErr..." [1] "++ Assigning lead SNP" [1] "++ Ensuring Effect, StdErr, P are numeric" [1] "++ Ensuring 1 SNP per row" [1] "++ Removing extra whitespace" [1] "++ Saving subset ==> ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/BST1_Nalls23andMe_2019_subset.tsv.gz" [1] "+ Extraction completed in 6.72 seconds" [1] "+ 115 SNPs x 16 columns" [1] "LD:: Using UK Biobank LD reference panel." [1] "+ UKB LD file name: chr4_15000001_18000001" [1] "+ LD:: Downloading full .gz/.npz UKB files and saving to disk." [1] "+ CONDA:: Identified axel executable in echoR env." [1] "+ CONDA:: Identified axel executable in echoR env." [1] "+ LD:: load_ld() python function input: ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_15000001_18000001" [1] "+ LD:: Reading LD matrix into memory. This could take some time..." ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_15000001_18000001.gz ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_15000001_18000001.npz Processed URL: ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/LD/chr4_15000001_18000001 Some other message at the end [1] "+ Full UKB LD matrix: 20815 x 20815" [1] "+ Full UKB LD SNP data.table: 20815 x 5" [1] "+ LD:: Saving LD matrix ==> ~/echolocatoR/results/GWAS/Nalls23andMe_2019/BST1/LD/BST1.UKB_LD.RDS" [1] "115 x 115 LD_matrix (sparse)" [1] "+ FILTER:: Filtering by LD features." [1] "FILTER:: Filtering by SNP features." [1] "+ FILTER:: Removing SNPs with MAF < 0.001" [1] "+ FILTER:: Post-filtered data: 115 x 16" vvvvv-- FINEMAP --vvvvv ✅ All required columns present. ✅ All suggested columns present. vvvvv-- SUSIE --vvvvv ✅ All required columns present. ✅ All suggested columns present. [1] "++ Fine-mapping using multiple tools: FINEMAP, SUSIE" +++ Multi-finemap:: FINEMAP +++ [1] "++ FINEMAP:: Constructing master file." [1] "++ FINEMAP:: Constructing data.z file." [1] "++ FINEMAP:: Constructing data.ld file." [1] "+ Using FINEMAP v1.4" |--------------------------------------| | Welcome to FINEMAP v1.4 | | | | (c) 2015-2020 University of Helsinki | | | | Help : | | - ./finemap --help | | - www.finemap.me | | - www.christianbenner.com | | | | Contact : | | - christian.benner@helsinki.fi | | - matti.pirinen@helsinki.fi | |--------------------------------------| -------- SETTINGS -------- - dataset : all - corr-config : 0.95 - n-causal-snps : 5 - n-configs-top : 50000 - n-conv-sss : 100 - n-iter : 100000 - n-threads : 1 - prior-k0 : 0 - prior-std : 0.05 - prob-conv-sss-tol : 0.001 - prob-cred-set : 0.95 ------------ FINE-MAPPING (1/1) ------------ - GWAS summary stats : FINEMAP/data.z - SNP correlations : FINEMAP/data.ld - Causal SNP stats : FINEMAP/data.snp - Causal configurations : FINEMAP/data.config - Credible sets : FINEMAP/data.cred - Log file : FINEMAP/data.log_sss - Reading input : done! - Number of GWAS samples : 216621 - Number of SNPs : 115 - Prior-Pr(# of causal SNPs is k) : (0 -> 0) 1 -> 0.584 2 -> 0.292 3 -> 0.0964 4 -> 0.0237 5 -> 0.00461 - 1616 configurations evaluated (0.104/100%) : converged after 104 iterations - Computing causal SNP statistics : done! - Regional SNP heritability : 0.00763 (SD: 0.000422 ; 95% CI: [0.00679,0.0085]) - Log10-BF of >= one causal SNP : 2.28e+04 - Post-expected # of causal SNPs : 5 - Post-Pr(# of causal SNPs is k) : (0 -> 0) 1 -> 0 2 -> 0 3 -> 0 4 -> 2.87e-262 5 -> 1 - Computing credible sets : done! - Writing output : done! - Run time : 0 hours, 0 minutes, 0 seconds [1] ".cred not detected. Using .snp instead." [1] "+ FINEMAP:: Importing prob (.snp)..." Error in `[.data.table`(x, r, vars, with = FALSE) : column(s) not found: prob [1] "++ Credible Set SNPs identified = 0" [1] "++ Merging FINEMAP results with multi-finemap data." +++ Multi-finemap:: SUSIE +++ [1] "+ Preparing sample_size (N) column" [1] "+ `N` column already present in data." [1] "+ SUSIE:: sample_size= 216621" [1] "+ SUSIE:: max_causal = 5" [1] "+ Subsetting LD matrix and finemap_dat to common SNPs..." [1] "+ LD:: Removing unnamed rows/cols" [1] "+ LD:: Replacing NAs with 0" [1] "+ LD_matrix = 115 SNPs." [1] "+ finemap_dat = 115 SNPs." [1] "+ 115 SNPs in common." [1] "+ SUSIE:: Using `susie_suff_stat()` from susieR v0.10.1" [1] "+ SUSIE:: Extracting Credible Sets..." [1] "++ Credible Set SNPs identified = 3" [1] "++ Merging SUSIE results with multi-finemap data." [1] "+ Identifying Consensus SNPs..." [1] "++ support_thresh = 2" [1] "++ top_CS_only=FALSE" [1] "+ Calculating mean Posterior Probability (mean.PP)..." [1] "+ Replacing PP==NA with 0" [1] "++ 2 fine-mapping methods used." [1] "++ 3 Credible Set SNPs identified." [1] "++ 0 Consensus SNPs identified." [1] "+ Fine-mapping with ' FINEMAP, SUSIE ' completed:" Time difference of 3.1 secs Fine-mapping complete in: Time difference of 1.7 mins [1] "+ Identifying Consensus SNPs..." [1] "++ support_thresh = 2" [1] "++ top_CS_only=FALSE" [1] "+ Calculating mean Posterior Probability (mean.PP)..." [1] "+ Replacing PP==NA with 0" [1] "++ 2 fine-mapping methods used." [1] "++ 3 Credible Set SNPs identified." [1] "++ 0 Consensus SNPs identified." > Nalls23andMe_2019.results[SUSIE.CS > 0, list(SNP, FINEMAP.CS, FINEMAP.PP, SUSIE.CS, SUSIE.PP)] SNP FINEMAP.CS FINEMAP.PP SUSIE.CS SUSIE.PP 1: rs34559912 NA NA 3 1.0000000 2: rs4389574 NA NA 1 1.0000000 3: rs6852450 NA NA 2 0.9999992 > readLines("results/GWAS/Nalls23andMe_2019/BST1/FINEMAP/data.snp", n = 3) [1] "index rsid chromosome position allele1 allele2 maf beta se z prob log10bf mean sd mean_incl sd_incl" [2] "1 rs6828144 4 15727389 T C 0.0263 0.1765 0.0309 5.71197 1 11.8522 1.40649 0.0355444 1.40649 0.0355444" [3] "111 rs11947310 4 15744576 A C 0.1909 0.0823 0.012 6.85833 1 11.8522 1.40649 0.0355444 1.40649 0.0355444" > readLines("results/GWAS/Nalls23andMe_2019/BST1/FINEMAP/data.cred5") [1] "# Post-Pr(# of causal SNPs is 5) = 1" [2] "#log10bf 22448.9 NA 22432.1 NA 20843.6 NA 556.379 NA 264.033 NA" [3] "#min(|ld|) 1 NA 1 NA 1 NA 1 NA 1 NA" [4] "#mean(|ld|) 1 NA 1 NA 1 NA 1 NA 1 NA" [5] "#median(|ld|) 1 NA 1 NA 1 NA 1 NA 1 NA" [6] "index cred1 prob1 cred2 prob2 cred3 prob3 cred4 prob4 cred5 prob5" [7] "1 rs11947310 1 rs11933202 1 rs1807250 1 rs6828144 1 rs73123615 0.999999" > sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: ~/mambaforge/envs/echoR/lib/libopenblasp-r0.3.12.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] echolocatoR_0.1.2 loaded via a namespace (and not attached): [1] colorspace_2.0-0 ellipsis_0.3.1 [3] class_7.3-18 biovizBase_1.38.0 [5] htmlTable_2.1.0 XVector_0.30.0 [7] GenomicRanges_1.42.0 base64enc_0.1-3 [9] gld_2.6.2 dichromat_2.0-0 [11] rstudioapi_0.13 proxy_0.4-25 [13] DT_0.17 bit64_4.0.5 [15] AnnotationDbi_1.52.0 fansi_0.4.2 [17] mvtnorm_1.1-1 xml2_1.3.2 [19] R.methodsS3_1.8.1 splines_4.0.3 [21] ggbio_1.38.0 cachem_1.0.4 [23] rootSolve_1.8.2.1 knitr_1.31 [25] jsonlite_1.7.2 Formula_1.2-4 [27] Rsamtools_2.6.0 dbplyr_2.1.1 [29] cluster_2.1.1 R.oo_1.24.0 [31] png_0.1-7 graph_1.68.0 [33] BiocManager_1.30.12 compiler_4.0.3 [35] httr_1.4.2 backports_1.2.1 [37] assertthat_0.2.1 Matrix_1.3-2 [39] fastmap_1.1.0 lazyeval_0.2.2 [41] htmltools_0.5.1.1 prettyunits_1.1.1 [43] tools_4.0.3 gtable_0.3.0 [45] glue_1.4.2 lmom_2.8 [47] GenomeInfoDbData_1.2.4 reshape2_1.4.4 [49] dplyr_1.0.5 rappdirs_0.3.3 [51] Rcpp_1.0.6 Biobase_2.50.0 [53] vctrs_0.3.7 Biostrings_2.58.0 [55] rtracklayer_1.50.0 crosstalk_1.1.1 [57] xfun_0.20 stringr_1.4.0 [59] lifecycle_1.0.0 ensembldb_2.14.0 [61] XML_3.99-0.6 zlibbioc_1.36.0 [63] MASS_7.3-53.1 scales_1.1.1 [65] BSgenome_1.58.0 VariantAnnotation_1.36.0 [67] ProtGenerics_1.22.0 hms_1.0.0 [69] MatrixGenerics_1.2.1 RBGL_1.66.0 [71] parallel_4.0.3 SummarizedExperiment_1.20.0 [73] expm_0.999-6 susieR_0.10.1 [75] AnnotationFilter_1.14.0 RColorBrewer_1.1-2 [77] curl_4.3 Exact_2.1 [79] reticulate_1.18 memoise_2.0.0 [81] gridExtra_2.3 ggplot2_3.3.3 [83] biomaRt_2.46.3 rpart_4.1-15 [85] reshape_0.8.8 latticeExtra_0.6-29 [87] stringi_1.5.3 RSQLite_2.2.5 [89] S4Vectors_0.28.1 e1071_1.7-6 [91] checkmate_2.0.0 GenomicFeatures_1.42.2 [93] BiocGenerics_0.36.0 boot_1.3-27 [95] BiocParallel_1.24.1 GenomeInfoDb_1.26.4 [97] rlang_0.4.10 pkgconfig_2.0.3 [99] matrixStats_0.58.0 bitops_1.0-6 [101] lattice_0.20-41 purrr_0.3.4 [103] GenomicAlignments_1.26.0 htmlwidgets_1.5.3 [105] bit_4.0.4 tidyselect_1.1.0 [107] GGally_2.1.1 plyr_1.8.6 [109] magrittr_2.0.1 R6_2.5.0 [111] IRanges_2.24.1 DescTools_0.99.40 [113] generics_0.1.0 Hmisc_4.5-0 [115] DelayedArray_0.16.3 DBI_1.1.1 [117] pillar_1.5.1 foreign_0.8-81 [119] survival_3.2-10 RCurl_1.98-1.3 [121] nnet_7.3-15 tibble_3.1.0 [123] crayon_1.4.1 utf8_1.2.1 [125] OrganismDbi_1.32.0 BiocFileCache_1.14.0 [127] jpeg_0.1-8.1 progress_1.2.2 [129] grid_4.0.3 data.table_1.14.0 [131] blob_1.2.1 digest_0.6.27 [133] R.utils_2.10.1 openssl_1.4.3 [135] stats4_4.0.3 munsell_0.5.0 [137] askpass_1.1 Warning message: In susie_func(bhat = subset_DT$Effect, shat = subset_DT$StdErr, : IBSS algorithm did not converge in 100 iterations! ```
bschilder commented 3 years ago

Hi @jdblischak, thanks for reporting this in such detail!

Here are my thoughts currently, any suggestions are welcome.

Credible Sets

Currently, I only record the top credible set in the multi-finemap file bc I was trying to avoid assigning multiple PPs to each SNP to keep FINEMAP.PP as a single column (all my other echolocatoR code is built around this structure). So basically, if SNPs are in the top CS they're assigned FINEMAP.CS=1 else 0. Perhaps there's a better way to do this that preserves this information without the user having to read in multiple .cred files themselves after running finemap_loci().

I can think of two strategies currently:

  1. Assign each FINEMAP CS to its own column (e.g. FINEMAP1.PP, FINEMAP1.CS, FINEMAP2.PP, FINEMAP2.CS, FINEMAP3.PP, FINEMAP3.CS, etc.) and revise the rest of echolocatoR's code to account for this change.
  2. Continue using only one col for each: FINEMAP.PP and FINEMAP.PP. When a SNP occurs in multiple CS, only record its FINEMAP.PP within the first CS it appears and assign its FINEMAP.CS to that first CS. This will work with the rest of the code but isn't exactly a faithful representation of the FINEMAP results.

FINEMAP version differences

zstandard

jdblischak commented 3 years ago

@bschilder Thanks for the detailed follow-up. My thoughts below:

So basically, if SNPs are in the top CS they're assigned FINEMAP.CS=1 else 0.

I didn't realize that FINEMAP.CS was only supposed to include the top CS. This is confusing when comparing to, e.g. the results from SuSiE. When the echolocatoR results show multiple CS from SuSiE but a single one from FINEMAP, this indicates a difference in behavior. But I think that is just due to echolocatoR, not a difference in the methods.

Also, I don't think that is what the code does, at least not for FINEMAP 1.4. As far as I can tell, the CS info from FINEAMP 1.4 is not consulted at all. Instead the algorithm appears to be if PIP > threshold, assign FINEMAP.CS = 1 else NA. Thus every SNP with a large PIP gets assigned to CS=1, even if they were assigned to separate CS.

Assign each FINEMAP CS to its own column (e.g. FINEMAP1.PP, FINEMAP1.CS, FINEMAP2.PP, FINEMAP2.CS, FINEMAP3.PP, FINEMAP3.CS, etc.) and revise the rest of echolocatoR's code to account for this change.

I don't understand this. The PIP for a SNP is unaffected by the CS it is assigned. My recommendation would be to obtain the information the same as the PolyFun authors do, unless you have a compelling reason that is incorrect:

  1. Obtain FINEMAP.PP from the column "prob" in the.snp output file
  2. Obtain FINEMAP.CS from the .cred# output file, where # is the largest one available

I'm happy to help with this. And my apologies if I am missing something obvious.

I may have to to email the original author since there doesn't seem to be any GitHub or issues forum for FINEMAP

You should absolutely email the author for help troubleshooting the installation on macOS. There is no public forum for discussing these issues, and the code is closed source, so we can't build a conda package to ease installation.

bschilder commented 3 years ago

Thanks for the feedback @jdblischak!

This does get rather confusing so I may have not have explained some things very well. I've tried a number of different ways about this so it's entirely possible I could be mixing up strategies from different version of echolocatoR and FINEMAP (apologies if this is the case).

Credible Sets and Posterior Probabilities

One of the main decision points I've gone back and forth on is whether to use .snp (marginal probabilities across all SNPs) vs .cred (conditional probabilities only for SNPs within a specific credible set) when constructing the final multi-finemap results file.

This is all a bit tricky to explain in words so I'm going to push some of these proposed changes to the dev branch. If you don't mind, perhaps you could then look these over and see if they seem sensible to you?

FINEMAP support

I tried reaching out to the FINEMAP author a couple weeks ago via email but haven't heard back yet (christian.benner@helsinki.fi and cc'ed finemap@christianbenner.com). Not sure if he still uses these emails but I can't seem to find any others. Also tried reaching out via LinkedIn. I'll keep you posted if I have any luck.

Thanks again for your help! -Brian

jdblischak commented 3 years ago

I think your suggestion makes quite a bit of sense to me. But I've also received some feedback from an anonymous reviewer that seems to have pretty intimate knowledge of FINEMAP who was adamant about using the conditional probabilities from .cred (I was previously using marg probs from .snp).

This doesn't make sense to me. If the goal is to produce results that can be compared across the different fine-mapping methods, I don't see why the conditional probabilities should be reported. For susieR you are reporting the PIPs in SUSIE.PP.

https://github.com/RajLabMSSM/echolocatoR/blob/b055ac0fb74c914d7600e3afc650a2ffd7149396/R/SUSIE.R#L194-L195

From your documentation:

https://github.com/RajLabMSSM/echolocatoR/blob/6b9086d6755be0cc802027b76fb8314058fa0bf1/README.Rmd#L311

From the FINEMAP documentation for data.snp (emphasis mine):

prob column contains the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l th SNP is the posterior probability that this SNP is causal.

From the FINEMAP documentation for data.cred:

The l th credible set shows the best candidate SNPs and their posterior probability of being in a k-SNP causal configuration that additionally contains k - 1 SNPs.

If you want to report the posterior probability that a given SNP is causal, I think you should report prob from data.snp.

In cases where there are only 1 SNP per CS, I don't think it matters whether you report the PIP from data.snp or the PP from data.cred#. The PP will be 1, and the PIP will be 1 or very close to it. However, when there are multiple SNPs in a CS, the PIP can be different from the PP.

As an example, here's code to run the example shipped with FINEMAP:

mkdir /tmp/fmexample
cd /tmp/fmexample
wget http://christianbenner.com/finemap_v1.4_x86_64.tgz
tar xf finemap_v1.4_x86_64.tgz
cd finemap_v1.4_x86_64/
./finemap_v1.4_x86_64 --sss --in-files example/master -dataset 1

Taking a quick look at the results in R, PIP and PP are not identical (though they are correlated). The most dramatic difference is rs6. It's PIP is only 0.04. However, it's PP for being included in cred3 is 0.17.

library(data.table)
snp <- fread("example/data.snp")

snp[prob > 0.009, .(rsid, prob)]
#    rsid      prob
# 1:  rs7 1.0000000
# 2: rs30 0.9999690
# 3:  rs6 0.0417854
# 4: rs31 0.0107216
# 5:  rs8 0.0105046
# 6: rs48 0.0100717

cred2 <- fread("example/data.cred2", skip = 5)
cred2
#    index cred1 prob1 cred2 prob2
# 1:     1   rs7     1  rs30     1

cred3 <- fread("example/data.cred3", skip = 5)
cred3[prob3 > 0.02, ]
#     index cred1 prob1 cred2 prob2 cred3     prob3
#  1:     1   rs7     1  rs30     1   rs6 0.1733260
#  2:     2  <NA>    NA  <NA>    NA  rs31 0.0433775
#  3:     3  <NA>    NA  <NA>    NA   rs8 0.0424081
#  4:     4  <NA>    NA  <NA>    NA  rs48 0.0405975
#  5:     5  <NA>    NA  <NA>    NA  rs46 0.0361650
#  6:     6  <NA>    NA  <NA>    NA  rs14 0.0281224
#  7:     7  <NA>    NA  <NA>    NA  rs10 0.0253583
#  8:     8  <NA>    NA  <NA>    NA  rs11 0.0248557
#  9:     9  <NA>    NA  <NA>    NA  rs19 0.0246048
# 10:    10  <NA>    NA  <NA>    NA  rs32 0.0242520
# 11:    11  <NA>    NA  <NA>    NA  rs20 0.0241354
# 12:    12  <NA>    NA  <NA>    NA  rs17 0.0237760
# 13:    13  <NA>    NA  <NA>    NA  rs13 0.0233587

For FINEMAP v1.4, The reviewer suggested that I use the .cred# file that shows largest posterior probability for the number of causal variants (line 1 of the file) and extract credible sets and SNP-wise conditional probabilities from that file.

This is a reasonable suggestion, but may not be worth the effort. If you report the highest number of credible sets, the user can always apply a PIP threshold to remove the SNPs that are less convincing (and thus remove the less convincing credible sets as well). The earlier credible sets should be unchanged.

The FINEMAP example above illustrates this. The PP that there are only 2 causal SNPs is much higher than the PP for 3 causal SNPs. However, if you were to report the results from data.cred3, and the user applied even a minimal PIP cutoff per SNP (e.g. >= 0.05), the final results would be identical.

% head -n 1 example/data.cred2
# Post-Pr(# of causal SNPs is 2) = 0.773769

% head -n 1 example/data.cred3
# Post-Pr(# of causal SNPs is 3) = 0.202562

This is all a bit tricky to explain in words so I'm going to push some of these proposed changes to the dev branch. If you don't mind, perhaps you could then look these over and see if they seem sensible to you?

Looking at b351c10b4cb25c21446abb3791a8cb0bb4084eda, I think it's good you now detect cred files other than data.cred.

https://github.com/RajLabMSSM/echolocatoR/blob/6b9086d6755be0cc802027b76fb8314058fa0bf1/R/FINEMAP.R#L337-L340

Though for the reason I stated above (and for consistency with PolyFun), I think you should select the cred file with the largest number of configurations, not the fewest, i.e. cred_path <- cred_path[length(cred_path)]

https://github.com/RajLabMSSM/echolocatoR/blob/6b9086d6755be0cc802027b76fb8314058fa0bf1/R/FINEMAP.R#L380-L382

bschilder commented 2 years ago

echolocatoR (via its subpackage echofinemap) now handles FINEMAP outputs differently, by including multiple columns which are now described in ?echofinemap::FINEMAP. Do let me know if you think any of these descriptions are lacking or not quite accurate:

Screenshot 2022-08-27 at 19 25 38

Note:

When performing fine-mapping via echofinemap::multifinemap or echlocatoR::finemap_loci, these columns will have a "FINEMAP." prefix to distinguish them from the results of other fine-mapping methods.

reprex

locus_dir <- file.path(tempdir(),echodata::locus_dir)
dat <- echofinemap::drop_finemap_cols(echodata::BST1)
LD_matrix <- echodata::BST1_LD_matrix
out <- echoLD::subset_common_snps(LD_matrix, dat)
LD_matrix <- out$LD
dat <- out$DT

dat2 <- echofinemap::FINEMAP(dat=dat,
                             locus_dir=locus_dir,
                             LD_matrix=LD_matrix)
Loading required namespace: MungeSumstats
Preparing sample size column (N).
Computing effective sample size using the LDSC method:
 Neff = (N_CAS+N_CON) * (N_CAS/(N_CAS+N_CON)) / mean((N_CAS/(N_CAS+N_CON))[(N_CAS+N_CON)==max(N_CAS+N_CON)]))
+ Mapping colnames from MungeSumstats ==> echolocatoR
+ Subsetting LD matrix and dat to common SNPs...
Removing unnamed rows/cols
Replacing NAs with 0
+ LD_matrix = 95 SNPs.
+ dat = 95 SNPs.
+ 95 SNPs in common.
Converting obj to sparseMatrix.
Constructing master file.
Constructing data.z file.
Constructing data.ld file.
FINEMAP path: /Users/schilder/Library/Caches/org.R-project.R/R/echofinemap/FINEMAP/finemap_v1.4.1_MacOSX/finemap_v1.4.1_MacOSX
dyld[20528]: Library not loaded: /usr/local/opt/gcc/lib/gcc/10/libgfortran.5.dylib
  Referenced from: /Users/schilder/Library/Caches/org.R-project.R/R/echofinemap/FINEMAP/finemap_v1.4.1_MacOSX/finemap_v1.4.1_MacOSX
  Reason: tried: '/usr/local/opt/gcc/lib/gcc/10/libgfortran.5.dylib' (no such file), '/usr/local/lib/libgfortran.5.dylib' (no such file), '/usr/lib/libgfortran.5.dylib' (no such file)
Inferred FINEMAP version: 
Running FINEMAP.
cd .../BST1 &&
    .../finemap_v1.3.1_MacOSX

    --sss

    --in-files .../master

    --log

    --n-causal-snps 5

|--------------------------------------|
| Welcome to FINEMAP v1.3.1            |
|                                      |
| (c) 2015-2018 University of Helsinki |
|                                      |
| Help :                               |
| - ./finemap --help                   |
| - www.finemap.me                     |
| - www.christianbenner.com            |
|                                      |
| Contact :                            |
| - christian.benner@helsinki.fi       |
| - matti.pirinen@helsinki.fi          |
|--------------------------------------|

--------
SETTINGS
--------
- dataset         : all
- corr-config     : 0.95
- n-causal-snps   : 5
- n-configs-top   : 50000
- n-convergence   : 1000
- n-iterations    : 100000
- prior-k0        : 0
- prior-std       : 0.05 
- prob-tol        : 0.001

------------
FINE-MAPPING (1/1)
------------
- GWAS summary stats               : FINEMAP/data.z
- SNP correlations                 : FINEMAP/data.ld
- Causal SNP stats                 : FINEMAP/data.snp
- Causal configurations            : FINEMAP/data.config
- Credible sets                    : FINEMAP/data.cred
- Log file                         : FINEMAP/data.log_sss
- Reading input                    : done!   

- Number of GWAS samples           : 1474097
- Number of SNPs                   : 95
- Prior-Pr(# of causal SNPs is k)  : 
  (0 -> 0)
   1 -> 0.584
   2 -> 0.292
   3 -> 0.0963
   4 -> 0.0236
   5 -> 0.00456
- 2994 configurations evaluated (1.008/100%) : converged after 1008 iterations
- Computing causal SNP statistics  : done!   
- Regional SNP heritability        : 0.0033 (SD: 9.5e-05 ; 95% CI: [0.00311,0.00349])
- Log10-BF of >= one causal SNP    : 1.04e+03
- Post-Pr(# of causal SNPs is k)   : 
  (0 -> 0)
   1 -> 0
   2 -> 0
   3 -> 0
   4 -> 9.67e-160
   5 -> 1
- Writing output                   : done    
- Run time                         : 0 hours, 0 minutes, 0 seconds
1 data.cred* file(s) found in the same subfolder.
Selected file based on postPr_k: data.cred
Importing conditional probabilities (.cred file).
Importing marginal probabilities (.snp file).
Importing configuration probabilities (.config file).
Warning message:
In echofinemap::FINEMAP(dat = dat, locus_dir = locus_dir, LD_matrix = LD_matrix) :

*********
 Error detected: 'dyld: Library not loaded: /usr/local/lib/libzstd.1.dylib'
 If you are using Mac OS, please install Zstandard
 (https://facebook.github.io/zstd/).
 e.g. `brew install zstd`

 Also, ensure that you have gcc v8 installed,
 as FINEMAP v1.4 is only compatible with this version.

 If Zstandard is already installed and this error persists,
 please see the main FINEMAP website for additional support
 (http://www.christianbenner.com). 
*********

results

head(dat2)

Image

FINEMAP version differences

I've spent quite a bit of time examining the outputs of each FINEMAP's versions outputs in different conditions, and I think I've accounted for these inechofinemap's internal processing functions. But do let me know if I've missed something or a new difference comes up in future versions of FINEMAP.

jdblischak commented 2 years ago

@bschilder Thanks for following up and for putting all this work into new functionality and documentation.

Do let me know if you think any of these descriptions are lacking or not quite accurate

In general I think returning more results to the end user is the right thing to do.

With your new setup, how does the number of credible sets compare between SuSiE and FINEMAP? That was the main issue I had with the previous iteration (ie FINEMAP always showed 1 CS even if it actually identified more). I attempted to run your reprex, but I couldn't get echoconda installed:

> remotes::install_github("RajLabMSSM/echoconda", upgrade = FALSE)
─  building ‘echoconda_0.99.6.tar.gz’

* installing *source* package ‘echoconda’ ...
** using non-staged installation via StagedInstall field
Error in data.table::fread(system.file(package = "echoconda", "conda/echoR_versions.tsv.gz")) :
  Input is empty or only contains BOM or terminal control characters
Calls: <Anonymous> -> eval -> eval -> get_echoR -> <Anonymous>
Execution halted
ERROR: configuration failed for package ‘echoconda’
bschilder commented 2 years ago

Regarding the conda installation issue, could you include your OS specs? Haven't encountered this issue on MacOS or Linux docker containers, and can't seem to replicate it now.

jdblischak commented 1 year ago

Regarding the conda installation issue, could you include your OS specs? Haven't encountered this issue on MacOS or Linux docker containers, and can't seem to replicate it now.

That error happened in a peculiar setup. R is running inside of a singularity container on an HPC environment. In case it's helpful, below I reproduced the installation error and appended the session information:

> remotes::install_github("RajLabMSSM/echoconda", upgrade = FALSE)
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo RajLabMSSM/echoconda@HEAD
✔  checking for file ‘/local/tmp/RtmphmtsG7/remotes561138236260/RajLabMSSM-echoconda-49cac2a/DESCRIPTION’ (351ms)
─  preparing ‘echoconda’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Omitted ‘LazyData’ from DESCRIPTION
─  building ‘echoconda_0.99.7.tar.gz’

Installing package into ‘/mnt/user-library/r420-bioc315-20220506’
(as ‘lib’ is unspecified)
* installing *source* package ‘echoconda’ ...
** using non-staged installation via StagedInstall field
Error in data.table::fread(system.file(package = "echoconda", "conda/echoR_versions.tsv.gz")) :
  Input is empty or only contains BOM or terminal control characters
Calls: <Anonymous> -> eval -> eval -> get_echoR -> <Anonymous>
Execution halted
ERROR: configuration failed for package ‘echoconda’
* removing ‘/mnt/user-library/r420-bioc315-20220506/echoconda’
Warning message:
In i.p(...) :
  installation of package ‘/local/tmp/RtmphmtsG7/file5611322a38a1/echoconda_0.99.7.tar.gz’ had non-zero exit status

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] processx_3.7.0    compiler_4.2.0    R6_2.5.1          rprojroot_2.0.3
 [5] cli_3.3.0         prettyunits_1.1.1 tools_4.2.0       withr_2.5.0
 [9] curl_4.3.2        crayon_1.5.1      remotes_2.4.2     callr_3.7.0
[13] pkgbuild_1.3.1    ps_1.7.1
jdblischak commented 1 year ago

Also, I was able to install echoconda and echofinemap in a more traditional setup with R installed in a conda environment. I just had to install echodata prior to installing echoconda (see PR https://github.com/RajLabMSSM/echoconda/pull/6)


> remotes::install_github("RajLabMSSM/echodata", upgrade = FALSE)
> remotes::install_github("RajLabMSSM/echoconda", upgrade = FALSE)
> library(echoconda)
> remotes::install_github("RajLabMSSM/echofinemap", upgrade = FALSE)
> library(echofinemap)
> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: ~/.conda/envs/test-echo/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] echofinemap_0.99.3 echoconda_0.99.7

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3            rjson_0.2.21
  [3] ellipsis_0.3.2              downloadR_0.99.4
  [5] XVector_0.34.0              GenomicRanges_1.46.1
  [7] remotes_2.4.2               DT_0.25
  [9] bit64_4.0.5                 AnnotationDbi_1.56.1
 [11] fansi_1.0.3                 xml2_1.3.3
 [13] splines_4.1.3               snpStats_1.44.0
 [15] R.methodsS3_1.8.2           cachem_1.0.6
 [17] echoLD_0.99.7               jsonlite_1.8.0
 [19] Rsamtools_2.10.0            dbplyr_2.2.1
 [21] png_0.1-7                   R.oo_1.25.0
 [23] echodata_0.99.12            BiocManager_1.30.18
 [25] readr_2.1.2                 compiler_4.1.3
 [27] httr_1.4.4                  basilisk_1.6.0
 [29] assertthat_0.2.1            Matrix_1.4-1
 [31] fastmap_1.1.0               cli_3.4.0
 [33] htmltools_0.5.3             prettyunits_1.1.1
 [35] tools_4.1.3                 gtable_0.3.1
 [37] glue_1.6.2                  GenomeInfoDbData_1.2.7
 [39] dplyr_1.0.10                rappdirs_0.3.3
 [41] Rcpp_1.0.9                  Biobase_2.54.0
 [43] vctrs_0.4.1                 Biostrings_2.62.0
 [45] echotabix_0.99.8            rtracklayer_1.54.0
 [47] stringr_1.4.1               openxlsx_4.2.5
 [49] lifecycle_1.0.2             irlba_2.3.5
 [51] restfulr_0.0.15             XML_3.99-0.10
 [53] zlibbioc_1.40.0             scales_1.2.1
 [55] basilisk.utils_1.6.0        BSgenome_1.62.0
 [57] VariantAnnotation_1.40.0    coloc_5.1.0.1
 [59] hms_1.1.2                   MatrixGenerics_1.6.0
 [61] parallel_4.1.3              SummarizedExperiment_1.24.0
 [63] susieR_0.12.27              yaml_2.3.5
 [65] curl_4.3.2                  memoise_2.0.1
 [67] reticulate_1.26             gridExtra_2.3
 [69] ggplot2_3.3.6               biomaRt_2.50.0
 [71] reshape_0.8.9               stringi_1.7.8
 [73] RSQLite_2.2.8               S4Vectors_0.32.4
 [75] BiocIO_1.4.0                GenomicFeatures_1.46.1
 [77] BiocGenerics_0.40.0         filelock_1.0.2
 [79] zip_2.2.1                   BiocParallel_1.28.3
 [81] GenomeInfoDb_1.30.0         rlang_1.0.5
 [83] pkgconfig_2.0.3             matrixStats_0.62.0
 [85] bitops_1.0-7                lattice_0.20-45
 [87] purrr_0.3.4                 GenomicAlignments_1.30.0
 [89] htmlwidgets_1.5.4           bit_4.0.4
 [91] tidyselect_1.1.2            plyr_1.8.7
 [93] magrittr_2.0.3              R6_2.5.1
 [95] IRanges_2.28.0              generics_0.1.3
 [97] piggyback_0.1.4             DelayedArray_0.20.0
 [99] DBI_1.1.3                   pillar_1.8.1
[101] survival_3.4-0              KEGGREST_1.34.0
[103] RCurl_1.98-1.8              mixsqp_0.3-43
[105] tibble_3.1.8                dir.expiry_1.2.0
[107] crayon_1.5.1                utf8_1.2.2
[109] BiocFileCache_2.2.0         tzdb_0.3.0
[111] viridis_0.6.2               progress_1.2.2
[113] grid_4.1.3                  data.table_1.14.2
[115] blob_1.2.3                  digest_0.6.29
[117] tidyr_1.2.1                 R.utils_2.12.0
[119] stats4_4.1.3                munsell_0.5.0
[121] viridisLite_0.4.1
jdblischak commented 1 year ago

Now that I have echofinemap installed, I tried to run the reprex. I got an error from MungeSumstats. I probably need to update all the packages to their latest versions, but I don't have the time to continue troubleshooting this at the moment

locus_dir <- file.path(tempdir(),echodata::locus_dir)
dat <- echofinemap::drop_finemap_cols(echodata::BST1)
LD_matrix <- echodata::BST1_LD_matrix
out <- echoLD::subset_common_snps(LD_matrix, dat)
LD_matrix <- out$LD
dat <- out$DT

dat2 <- echofinemap::FINEMAP(dat=dat,
                             locus_dir=locus_dir,
                             LD_matrix=LD_matrix)
Loading required namespace: MungeSumstats
Preparing sample size column (N).
Error: 'compute_nsize' is not an exported object from 'namespace:MungeSumstats'

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /gstore/home/blischaj/.conda/envs/test-echo/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] echofinemap_0.99.3 echoconda_0.99.7

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3            rjson_0.2.21
  [3] ellipsis_0.3.2              downloadR_0.99.4
  [5] XVector_0.34.0              fs_1.5.2
  [7] GenomicRanges_1.46.1        remotes_2.4.2
  [9] DT_0.25                     bit64_4.0.5
 [11] AnnotationDbi_1.56.1        fansi_1.0.3
 [13] xml2_1.3.3                  splines_4.1.3
 [15] snpStats_1.44.0             R.methodsS3_1.8.2
 [17] cachem_1.0.6                echoLD_0.99.7
 [19] jsonlite_1.8.0              Rsamtools_2.10.0
 [21] dbplyr_2.2.1                png_0.1-7
 [23] R.oo_1.25.0                 echodata_0.99.12
 [25] BiocManager_1.30.18         readr_2.1.2
 [27] compiler_4.1.3              httr_1.4.4
 [29] basilisk_1.6.0              assertthat_0.2.1
 [31] Matrix_1.4-1                fastmap_1.1.0
 [33] gargle_1.2.1                cli_3.4.0
 [35] htmltools_0.5.3             prettyunits_1.1.1
 [37] tools_4.1.3                 gtable_0.3.1
 [39] glue_1.6.2                  GenomeInfoDbData_1.2.7
 [41] dplyr_1.0.10                rappdirs_0.3.3
 [43] Rcpp_1.0.9                  Biobase_2.54.0
 [45] vctrs_0.4.1                 Biostrings_2.62.0
 [47] echotabix_0.99.8            rtracklayer_1.54.0
 [49] stringr_1.4.1               MungeSumstats_1.2.0
 [51] openxlsx_4.2.5              lifecycle_1.0.2
 [53] irlba_2.3.5                 restfulr_0.0.15
 [55] XML_3.99-0.10               googleAuthR_2.0.0
 [57] zlibbioc_1.40.0             scales_1.2.1
 [59] basilisk.utils_1.6.0        BSgenome_1.62.0
 [61] VariantAnnotation_1.40.0    coloc_5.1.0.1
 [63] hms_1.1.2                   MatrixGenerics_1.6.0
 [65] parallel_4.1.3              SummarizedExperiment_1.24.0
 [67] susieR_0.12.27              yaml_2.3.5
 [69] curl_4.3.2                  memoise_2.0.1
 [71] reticulate_1.26             gridExtra_2.3
 [73] ggplot2_3.3.6               biomaRt_2.50.0
 [75] reshape_0.8.9               stringi_1.7.8
 [77] RSQLite_2.2.8               S4Vectors_0.32.4
 [79] BiocIO_1.4.0                GenomicFeatures_1.46.1
 [81] BiocGenerics_0.40.0         filelock_1.0.2
 [83] zip_2.2.1                   BiocParallel_1.28.3
 [85] GenomeInfoDb_1.30.0         rlang_1.0.5
 [87] pkgconfig_2.0.3             matrixStats_0.62.0
 [89] bitops_1.0-7                lattice_0.20-45
 [91] purrr_0.3.4                 GenomicAlignments_1.30.0
 [93] htmlwidgets_1.5.4           bit_4.0.4
 [95] tidyselect_1.1.2            plyr_1.8.7
 [97] magrittr_2.0.3              R6_2.5.1
 [99] IRanges_2.28.0              generics_0.1.3
[101] piggyback_0.1.4             DelayedArray_0.20.0
[103] DBI_1.1.3                   pillar_1.8.1
[105] survival_3.4-0              KEGGREST_1.34.0
[107] RCurl_1.98-1.8              mixsqp_0.3-43
[109] tibble_3.1.8                dir.expiry_1.2.0
[111] crayon_1.5.1                utf8_1.2.2
[113] BiocFileCache_2.2.0         tzdb_0.3.0
[115] viridis_0.6.2               progress_1.2.2
[117] grid_4.1.3                  data.table_1.14.2
[119] blob_1.2.3                  digest_0.6.29
[121] tidyr_1.2.1                 R.utils_2.12.0
[123] stats4_4.1.3                munsell_0.5.0
[125] viridisLite_0.4.1
bschilder commented 1 year ago

1.

That error happened in a peculiar setup. R is running inside of a singularity container on an HPC environment. In case it's helpful, below I reproduced the installation error and appended the session information:

Error in data.table::fread(system.file(package = "echoconda", "conda/echoR_versions.tsv.gz")) :
  Input is empty or only contains BOM or terminal control characters

This happens bc data.table requires R.utils to be installed to read in gz-compressed files on certain OS. I've now made R.utils an Import of echodata to avoid this in the future.

2.

Also, I was able to install echoconda and echofinemap in a more traditional setup with R installed in a conda environment. I just had to install echodata prior to installing echoconda (see PR https://github.com/RajLabMSSM/echoconda/pull/6)

Great, thanks for the PR! I must have added echodata functions later and forgot to add it as a dep. Will merge this soon.

3.

Now that I have echofinemap installed, I tried to run the reprex. I got an error from MungeSumstats. I probably need to update all the packages to their latest versions, but I don't have the time to continue troubleshooting this at the moment

You're exactly right, MungeSumstats exported compute_nsize in v1.3.9.

Though it's odd this is coming up for you since I have a minimum version required for echodata in the DESCRIPTION file: MungeSumstats (>= 1.3.14). I've added an additional safeguard within echodata::get_sample_size to let users know about this version requirement.

 requireNamespace("MungeSumstats")
    if(packageVersion("MungeSumstats")<"1.3.14"){
        stp <- "echodata requires MungeSumstats >=1.3.14"
        stop(stp)
    }
jdblischak commented 1 year ago

This happens bc data.table requires R.utils to be installed to read in gz-compressed files on certain OS. I've now made R.utils an Import of echodata to avoid this in the future.

This isn't blocking me, so feel free to ignore. But I already had R.utils installed in that environment, so I am skeptical that was the problem. In case it could be useful for future troubleshooting, here are the versions of data.table and R.utils I had:

library("data.table")
## data.table 1.14.2 using 1 threads (see ?getDTthreads).  Latest news: r-datatable.com
library("R.utils")
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.
## 
## Attaching package: ‘R.oo’
## 
## The following object is masked from ‘package:R.methodsS3’:
## 
##     throw
## 
## The following objects are masked from ‘package:methods’:
## 
##     getClasses, getMethods
## 
## The following objects are masked from ‘package:base’:
## 
##     attach, detach, load, save
## 
## R.utils v2.12.0 (2022-06-28 03:20:05 UTC) successfully loaded. See ?R.utils for help.
## 
## Attaching package: ‘R.utils’
## 
## The following object is masked from ‘package:utils’:
## 
##     timestamp
## 
## The following objects are masked from ‘package:base’:
## 
##     cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
## 
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
## 
## other attached packages:
## [1] R.utils_2.12.0    R.oo_1.25.0       R.methodsS3_1.8.2 data.table_1.14.2
## 
## loaded via a namespace (and not attached):
## [1] compiler_4.2.0 tools_4.2.0
jdblischak commented 1 year ago

Also, would you be able to export the dat2 object from your reprex as a plain text file and attach it to this Issue? I'd like to be able to look at the new results without having to run the reprex myself locally.