RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
34 stars 11 forks source link

New version of echolocatoR fails to read input files #86

Closed AMCalejandro closed 1 year ago

AMCalejandro commented 2 years ago

Hi,

This is related to #85 Apparently the new coloc version is messing up trying to read the input

(echoR) acarrasco@kronos:/mnt/rreal/RDS/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/SCRIPTS$ cat MAD1L1_LIN50051.SCRIP_V2.OUT 
nohup: ignoring input
Warning messages:
1: replacing previous import ‘data.table::last’ by ‘dplyr::last’ when loading ‘echolocatoR’ 
2: replacing previous import ‘data.table::first’ by ‘dplyr::first’ when loading ‘echolocatoR’ 
3: replacing previous import ‘data.table::between’ by ‘dplyr::between’ when loading ‘echolocatoR’ 
[1] "+ CONDA:: Activating conda env 'echoR'"
[1] "Checking for tabix installation..."
[1] "Checking for bcftools installation..."
WARNING:: fullSS_genome_build not provided. Assuming hg19.

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
LINC00511 (1 / 2)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "TABIX:: Tabix-indexing file."
[1] "Determining chrom type from file header"
[1] "LD:: Standardizing summary statistics subset."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: #MarkerName
Fine-mapping complete in:
Time difference of 35.5 secs

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
MAD1L1 (2 / 2)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "TABIX:: Tabix-indexing file."
[1] "Determining chrom type from file header"
[1] "LD:: Standardizing summary statistics subset."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: #MarkerName
Fine-mapping complete in:
Time difference of 29.8 secs
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: SNP
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /home/acarrasco/.conda/envs/echoR/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] echolocatoR_0.2.3 data.table_1.14.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           XVector_0.30.0         GenomeInfoDb_1.26.7   
 [4] pillar_1.7.0           compiler_4.0.3         R.methodsS3_1.8.1     
 [7] R.utils_2.11.0         tools_4.0.3            bitops_1.0-7          
[10] zlibbioc_1.36.0        jsonlite_1.8.0         lifecycle_1.0.1       
[13] tibble_3.1.6           gtable_0.3.0           lattice_0.20-41       
[16] pkgconfig_2.0.3        png_0.1-7              rlang_1.0.2           
[19] Matrix_1.3-2           DBI_1.1.2              cli_3.2.0             
[22] parallel_4.0.3         GenomeInfoDbData_1.2.4 dplyr_1.0.8           
[25] Biostrings_2.58.0      IRanges_2.24.1         S4Vectors_0.28.1      
[28] generics_0.1.2         vctrs_0.3.8            rappdirs_0.3.3        
[31] stats4_4.0.3           grid_4.0.3             tidyselect_1.1.2      
[34] reticulate_1.24        glue_1.6.2             R6_2.5.1              
[37] fansi_1.0.3            BiocParallel_1.24.1    seqminer_8.4          
[40] ggplot2_3.3.5          purrr_0.3.4            magrittr_2.0.2        
[43] GenomicRanges_1.42.0   scales_1.1.1           Rsamtools_2.6.0       
[46] ellipsis_0.3.2         BiocGenerics_0.36.1    assertthat_0.2.1      
[49] colorspace_2.0-3       utf8_1.2.2             RCurl_1.98-1.6        
[52] munsell_0.5.0          crayon_1.5.0           R.oo_1.24.0           
(echoR) acarrasco@kronos:/mnt/rreal/RDS/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/SCRIPTS$ 

Something interesting is that, if I do not activate the conda environment and then I run finemap_loci() from a script, echolocatoR does not fail in reading the input files....


vvvvv-- ABF --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- FINEMAP --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- SUSIE --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- POLYFUN_SUSIE --vvvvv
✅ All required columns present.
✅ All suggested columns present.

+++ Multi-finemap:: ABF +++

+++ Multi-finemap:: FINEMAP +++
[1] "++ FINEMAP:: Constructing master file."
[1] "++ FINEMAP:: Constructing data.z file."
[1] "++ FINEMAP:: Constructing data.ld file."
[1] "+ Using FINEMAP v1.4"

|--------------------------------------|
| Welcome to FINEMAP v1.4              |
|                                      |
| (c) 2015-2020 University of Helsinki |
|                                      |
| Help :                               |
| - ./finemap --help                   |
| - www.finemap.me                     |
| - www.christianbenner.com            |
|                                      |
| Contact :                            |
| - christian.benner@helsinki.fi       |
| - matti.pirinen@helsinki.fi          |
|--------------------------------------|

--------
SETTINGS
--------
- dataset            : all
- corr-config        : 0.95
- n-causal-snps      : 5
- n-configs-top      : 50000
- n-conv-sss         : 100
- n-iter             : 100000
- n-threads          : 1
- prior-k0           : 0
- prior-std          : 0.05 
- prob-conv-sss-tol  : 0.001
- prob-cred-set      : 0.95

------------
FINE-MAPPING (1/1)
------------
- GWAS summary stats               : FINEMAP/data.z
- SNP correlations                 : FINEMAP/data.ld
- Causal SNP stats                 : FINEMAP/data.snp
- Causal configurations            : FINEMAP/data.config
- Credible sets                    : FINEMAP/data.cred
- Log file                         : FINEMAP/data.log_sss
- Reading input                    : done!   

- Number of GWAS samples           : 3572
- Number of SNPs                   : 4057
- Prior-Pr(# of causal SNPs is k)  : 
  (0 -> 0)
   1 -> 0.583
   2 -> 0.291
   3 -> 0.0971
   4 -> 0.0243
   5 -> 0.00485
- 100430 configurations evaluated (0.113/100%) : converged after 113 iterations
- Computing causal SNP statistics  : done!   
- Regional SNP heritability        : 93.1 (SD: 0.374 ; 95% CI: [92.3,93.8])
- Log10-BF of >= one causal SNP    : 109
- Post-expected # of causal SNPs   : 5
- Post-Pr(# of causal SNPs is k)   : 
  (0 -> 0)
   1 -> 0
   2 -> 0
   3 -> 1.92e-44
   4 -> 3.85e-32
   5 -> 1
- Computing credible sets          : done!
- Writing output                   : done!   
- Run time                         : 0 hours, 0 minutes, 30 seconds
[1] ".cred not detected. Using .snp instead."
[1] "+ FINEMAP:: Importing prob (.snp)..."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: prob
In addition: Warning message:
In sdY.est(d$varbeta, d$MAF, d$N) :
  estimating sdY from maf and varbeta, please directly supply sdY if known

r$> sessionInfo()                                                                                                                                                                                          
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /home/acarrasco/.conda/envs/echoR/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] echolocatoR_0.2.3

loaded via a namespace (and not attached):
 [1] magrittr_2.0.2    tidyselect_1.1.2  munsell_0.5.0     colorspace_2.0-3  R6_2.5.1          rlang_1.0.2       fansi_1.0.3       dplyr_1.0.8       grid_4.0.3        data.table_1.14.2 gtable_0.3.0     
[12] utf8_1.2.2        cli_3.2.0         DBI_1.1.2         ellipsis_0.3.2    assertthat_0.2.1  tibble_3.1.6      lifecycle_1.0.1   crayon_1.5.0      purrr_0.3.4       ggplot2_3.3.5     vctrs_0.3.8      
[23] glue_1.6.2        compiler_4.0.3    pillar_1.7.0      generics_0.1.2    scales_1.1.1      pkgconfig_2.0.3 

The code from these two outputs is the same.

#!usr/bin/Rscript
library(data.table)
library(echolocatoR)  
#library(tidyverse)

fullSS_path <- "/mnt/rreal/RDS/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/for_echolocatoR_axialOutcome_3.tsv"                                                                    
fullRS_path <- "/mnt/rreal/RDS/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS_25.3.2022"

top_SNPs = fread("../topSNPs_axialMotorSymptom.txt") 
top_SNPs = top_SNPs[c(5,6), ]

earlymotor_PD = finemap_loci(top_SNPs = top_SNPs,
                                              loci = top_SNPs$Locus, 
                                              dataset_name = "earlymotorPD_axial", 
                                              dataset_type = "mixedmodels_GWAS",   
                                              force_new_subset = T, 
                                              force_new_LD = T, 
                                              force_new_finemap = T, 
                                              remove_tmps = F, 

                     # SUMMARY STATS ARGUMENTS 
                     fullSS_path = fullSS_path,
                     results_dir = fullRS_path,
                     query_by = "tabix", 
                     chrom_col = "CHR", position_col = "POS", snp_col = "#MarkerName", 
                     pval_col = "Pval", effect_col = "Effect", stderr_col = "StdErr", 
                     freq_col = "medianFreq", MAF_col = "calculate", 
                     A1_col = "Allele1", 
                     A2_col = "Allele2", 
                     #N_cases_col = "TotalSampleSize",
                     #N_controls = 0,

                     # FILTERING ARGUMENTS 
                     ## It's often desirable to use a larger window size  
                     ## (e.g. 2Mb which is bp_distance=500000*2),  
                     ## but we use a small window here to speed up the process.  
                     bp_distance = 500000*2, 
                     min_MAF = 0.001,   
                     trim_gene_limits = F, 

                     # FINE-MAPPING ARGUMENTS 
                     ## General 
                     finemap_methods = c("ABF", "SUSIE", "POLYFUN_SUSIE"), #"FINEMAP"  
                     n_causal = 5, 
                     PP_threshold = .95,  

                     # LD ARGUMENTS  
                     LD_reference = "UKB", 
                     superpopulation = "EUR", 
                     download_method = "axel", 

                     # Additional arguments - My arguments
                     case_control = F,
                     nThread = 15,
                     sample_size = 3572,

                     # PLOT ARGUMENTS  
                     ## general    
                     plot.types=c("fancy"), 
                     ## Generate multiple plots of different window sizes;  
                     ### all SNPs, 4x zoomed-in, and a 50000bp window 
                     plot.zoom = c("all","4x","10x", "30x"), 
                     ## XGR 
                     # plot.XGR_libnames=c("ENCODE_TFBS_ClusteredV3_CellTypes"),  
                     ## Roadmap 
                     plot.Roadmap = F, 
                     plot.Roadmap_query = NULL, 
                     # Nott et al. (2019) 
                     plot.Nott_epigenome = T,  
                     plot.Nott_show_placseq = T,  

                     verbose = F 
                     )

The only difference is that in the first output I activate the environment before running the script whereas on the second output I do not activate the environment before running the script.

I thought it could be I was loasing a previous echolocatoR version outside the conda environment. However, we see on the sesion info that the echolocatoR version is the most up to date in both cases....

There are two issues.

bschilder commented 2 years ago

Hi @AMCalejandro, I've rewritten a lot of these steps (including input data standardization) and broken them down into subpackages since this post. Does this persist in the current echoverse branch (now master branch)?