RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
33 stars 11 forks source link

Error doing ABF #47

Closed alexMarCar closed 2 years ago

alexMarCar commented 3 years ago

Hi,

I am getting the following error that I am not able to understand when using the finemal_loci() function to do ABF

[1] "+ Full UKB LD matrix: 22590 x 22590"
[1] "+ Full UKB LD SNP data.table: 22590 x 5"
[1] "UKB MAF:: Extracting MAF from UKB reference."
[1] "+ CONDA:: Identified axel executable in echoR env."
Error in `[.data.table`(y, x, nomatch = if (all.x) NA else NULL, on = by,  : 
  logical error. i is not a data.table, but 'on' argument is provided.
Fine-mapping complete in:
Time difference of 2 mins
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: SNP

This is the function calling:

r$> Maryam_colocFIneMap_Results <- finemap_loci(top_SNPs = top_SNPs,      
                                                      results_dir = "/data/kronos/kronos/acarrasco/FOR_MARYAM/results",   
                                                      loci = c("locus1"),    
                                                      dataset_name = "GWAS_coloc",    
                                                      dataset_type = "GWAS",      
                                                      force_new_subset = F,    
                                                      force_new_LD = T,    
                                                      force_new_finemap = T,    
                                                      remove_tmps = F,    
                                                      fullSS_path = "/data/kronos/kronos/acarrasco/FOR_MARYAM/DLB-vs-Nalls-all_se.txt",  
                                                      query_by = "tabix",    
                          chrom_col = "CHR", position_col = "POS", snp_col = "SNP", pval_col = "woolf_p",  
                          effect_col = "log_woolf_OR", stderr_col = "se", N_cases_col = "NPD", N_controls_col = "NCO",  
                          proportion_cases = "prportion_cases",  
                            
                          bp_distance = 500000*2,   
                          trim_gene_limits = F,   
                            
                          finemap_methods = c("ABF"),                     
                         LD_reference = "UKB",   
                             superpopulation = "EUR",   
                             download_method = "axel",   
                               plot.types=c("fancy"),   
                             ## Generate multiple plots of different window sizes;    
                             ### all SNPs, 4x zoomed-in, and a 50000bp window   
                             plot.zoom = c("all","4x","10x"),   
                             ## XGR   
                             # plot.XGR_libnames=c("ENCODE_TFBS_ClusteredV3_CellTypes"),    
                             ## Roadmap   
                             plot.Roadmap = F,   
                             plot.Roadmap_query = NULL,   
                             # Nott et al. (2019)   
                             plot.Nott_epigenome = T,    
                             plot.Nott_show_placseq = T,  
                             verbose = F )    

Also, this is the header of my GWAS

CHR    POS SNP A1  A2  NPD NCO woolf_p log_woolf_OR    log_OR1 log_OR2 woolf_p_adjusted    tval    se  prportion_cases
1   798400  chr1:798400 G   A   26035   441778  0.054355312527485   -0.0129364098436779 -0.0937025329495809 -0.0077974327837792 0.149450021680241   1.92408105275612    -0.00672342250091146    0.058932314420365
1   798959  chr1:798959 A   G   26035   441778  0.0624429326097045  -0.0166813371949653 -0.0949059766764264 -0.0116961408205252 0.162749942120059   1.86321743251793    -0.00895297398136848    0.058932314420365
1   1037303 chr1:1037303    C   T   33674   449056  0.413943979264559   -0.0415678483019056 -0.0022959606094905 -0.0436749490486619 0.540481742929293   0.816982901531581   -0.0508797041210767 0.0749884201524977

Of note, the top_SNPs were got by using the import_topSNPs() function after doing a grouping by CHR, and when calling the finemap_loci() function I specify the locus I am interested in through the "loci" flag from the generated top_SNPs datafrmae.

The sessionInfo:

r$> utils::sessionInfo()                                                                                                                                                                                   
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home/acarrasco/.conda/envs/echoR/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] echolocatoR_0.1.2 data.table_1.14.0 forcats_0.5.1     stringr_1.4.0     dplyr_1.0.5       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3       tibble_3.1.1      ggplot2_3.3.3     tidyverse_1.3.1  

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                backports_1.2.1             Hmisc_4.5-0                 BiocFileCache_1.14.0        plyr_1.8.6                  lazyeval_0.2.2              splines_4.0.3              
  [8] crosstalk_1.1.1             BiocParallel_1.24.1         GenomeInfoDb_1.26.7         digest_0.6.27               ensembldb_2.14.1            htmltools_0.5.1.1           fansi_0.4.2                
 [15] magrittr_2.0.1              checkmate_2.0.0             memoise_2.0.0               BSgenome_1.58.0             cluster_2.1.1               Biostrings_2.58.0           modelr_0.1.8               
 [22] matrixStats_0.58.0          R.utils_2.10.1              ggbio_1.38.0                askpass_1.1                 prettyunits_1.1.1           jpeg_0.1-8.1                colorspace_2.0-0           
 [29] rappdirs_0.3.3              blob_1.2.1                  rvest_1.0.0                 haven_2.4.1                 xfun_0.22                   crayon_1.4.1                RCurl_1.98-1.3             
 [36] jsonlite_1.7.2              graph_1.68.0                Exact_2.1                   VariantAnnotation_1.36.0    survival_3.2-11             glue_1.4.2                  gtable_0.3.0               
 [43] zlibbioc_1.36.0             XVector_0.30.0              DelayedArray_0.16.3         BiocGenerics_0.36.1         scales_1.1.1                mvtnorm_1.1-1               DBI_1.1.1                  
 [50] GGally_2.1.1                Rcpp_1.0.6                  progress_1.2.2              htmlTable_2.1.0             reticulate_1.19             foreign_0.8-81              bit_4.0.4                  
 [57] proxy_0.4-25                OrganismDbi_1.32.0          Formula_1.2-4               DT_0.18                     stats4_4.0.3                htmlwidgets_1.5.3           httr_1.4.2                 
 [64] RColorBrewer_1.1-2          ellipsis_0.3.1              R.methodsS3_1.8.1           pkgconfig_2.0.3             reshape_0.8.8               XML_3.99-0.6                nnet_7.3-15                
 [71] dbplyr_2.1.1                utf8_1.2.1                  tidyselect_1.1.0            rlang_0.4.10                reshape2_1.4.4              AnnotationDbi_1.52.0        munsell_0.5.0              
 [78] cellranger_1.1.0            tools_4.0.3                 cachem_1.0.4                cli_2.5.0                   generics_0.1.0              RSQLite_2.2.5               broom_0.7.6                
 [85] fastmap_1.1.0               knitr_1.33                  bit64_4.0.5                 fs_1.5.0                    AnnotationFilter_1.14.0     rootSolve_1.8.2.1           RBGL_1.66.0                
 [92] R.oo_1.24.0                 xml2_1.3.2                  biomaRt_2.46.3              compiler_4.0.3              rstudioapi_0.13             curl_4.3                    png_0.1-7                  
 [99] e1071_1.7-6                 reprex_2.0.0                DescTools_0.99.41           stringi_1.5.3               ps_1.6.0                    GenomicFeatures_1.42.3      lattice_0.20-41            
[106] ProtGenerics_1.22.0         Matrix_1.3-2                vctrs_0.3.7                 pillar_1.6.0                lifecycle_1.0.0             BiocManager_1.30.12         bitops_1.0-7               
[113] lmom_2.8                    rtracklayer_1.50.0          GenomicRanges_1.42.0        R6_2.5.0                    latticeExtra_0.6-29         gridExtra_2.3               IRanges_2.24.1             
[120] gld_2.6.2                   dichromat_2.0-0             boot_1.3-27                 MASS_7.3-53.1               assertthat_0.2.1            SummarizedExperiment_1.20.0 openssl_1.4.3              
[127] withr_2.4.2                 GenomicAlignments_1.26.0    Rsamtools_2.6.0             S4Vectors_0.28.1            GenomeInfoDbData_1.2.4      expm_0.999-6                parallel_4.0.3             
[134] hms_1.0.0                   grid_4.0.3                  rpart_4.1-15                class_7.3-18                MatrixGenerics_1.2.1        biovizBase_1.38.0           Biobase_2.50.0             
[141] lubridate_1.7.10            base64enc_0.1-3            

Many thanks!, Alejandro.

bschilder commented 3 years ago

I think this is happening bc the "SNP" col in your GWAS has coordinates instead of RSIDs. The step that merges the LD panel with the GWAS sum stats does so using matching RSIDS.

alexMarCar commented 3 years ago

I just changed the the SNP notation from chr:pos to rsIDs, and I am getting the same issue.

[1] "+ Full UKB LD matrix: 20801 x 20801"
[1] "+ Full UKB LD SNP data.table: 20801 x 5"
[1] "UKB MAF:: Extracting MAF from UKB reference."
[1] "+ CONDA:: Identified axel executable in echoR env."
Error in `[.data.table`(y, x, nomatch = if (all.x) NA else NULL, on = by,  : 
  logical error. i is not a data.table, but 'on' argument is provided.
Fine-mapping complete in:
Time difference of 2 mins
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: SNP

The GWAS results header after the update:

CHR    POS A1  A2  NPD NCO woolf_p log_woolf_OR    log_OR1 log_OR2 woolf_p_adjusted    tval    se  prportion_cases SNP
1   798400  G   A   26035   441778  0.054355312527485   -0.0129364098436779 -0.0937025329495809 -0.0077974327837792 0.149450021680241   1.92408105275612    -0.00672342250091146    0.058932314420365   rs10900604
1   798959  A   G   26035   441778  0.0624429326097045  -0.0166813371949653 -0.0949059766764264 -0.0116961408205252 0.162749942120059   1.86321743251793    -0.00895297398136848    0.058932314420365   rs11240777
1   1037303 C   T   33674   449056  0.413943979264559   -0.0415678483019056 -0.0022959606094905 -0.0436749490486619 0.540481742929293   0.816982901531581   -0.0508797041210767 0.0749884201524977  rs11260592
1   1037313 G   A   33674   449056  0.398505671669803   -0.0420967280143385 -0.00152942739309308    -0.0442746217585758 0.527025397597339   0.844303781275011   -0.049859693806851  0.0749884201524977  rs11260593

Best and many thanks!, Alejandro.

bschilder commented 3 years ago

Did you use force_new_subset = T to make sure it's not using your old file subset?

alexMarCar commented 3 years ago

Yes. I am actually removing everything under the results directory (rm -rf ./results/*) to make sure each command is run from scratch. There is just one thing I am not sure if it is fine for this tool - I do not have on my GWAS results gene and locus info. When getting top SNPs per chromosome, I get a locus name inferred when using import_topSNPs

21:   locus_chr8_rs2513924   locus_chr8_rs2513924   8 103693753   rs2513924 1.234177e-07 -0.061049972
22:   locus_chr9_rs2821150   locus_chr9_rs2821150   9 113520979   rs2821150 2.524724e-06 -0.018425605
                     Locus                   Gene CHR       POS         SNP            P       Effect

Subsequently, the locus name that I pass on the finemap_loci() would be ie locus_chr8_rs2513924. I do not know if this tool is expecting a real locus name? - I deduced this does not really matter as soon as the chromosome and position are correctly passed.

Update

I have tried to run ABF on single locus modifying the top_SNPs df and giving the positions of interest a real locus and gene name. As I was expecting it did not make any difference

alexMarCar commented 3 years ago

Maybe it helps If I share the whole code for a locus example

Input GWAS data

CHR    POS A1  A2  NPD NCO woolf_p log_woolf_OR    log_OR1 log_OR2 woolf_p_adjusted    tval    se  prportion_cases SNP
1   798400  G   A   26035   441778  0.054355312527485   -0.0129364098436779 -0.0937025329495809 -0.0077974327837792 0.149450021680241   1.92408105275612    -0.00672342250091146    0.058932314420365   rs10900604
1   798959  A   G   26035   441778  0.0624429326097045  -0.0166813371949653 -0.0949059766764264 -0.0116961408205252 0.162749942120059   1.86321743251793    -0.00895297398136848    0.058932314420365   rs11240777
1   1037303 C   T   33674   449056  0.413943979264559   -0.0415678483019056 -0.0022959606094905 -0.0436749490486619 0.540481742929293   0.816982901531581   -0.0508797041210767 0.0749884201524977  rs11260592
1   1037313 G   A   33674   449056  0.398505671669803   -0.0420967280143385 -0.00152942739309308    -0.0442746217585758 0.527025397597339   0.844303781275011   -0.049859693806851  0.0749884201524977  rs11260593
1   1037367 A   G   33674   449056  0.388182072806092   -0.0414481828134772 0   -0.043674956004422  0.517950444263371   0.862930050756748   -0.0480319149589577 0.0749884201524977  rs11260594

The top_SNPs data frame

21:   locus_chr8_rs2513924   locus_chr8_rs2513924   8 103693753   rs2513924 1.234177e-07 -0.061049972
22:   locus_chr9_rs2821150   locus_chr9_rs2821150   9 113520979   rs2821150 2.524724e-06 -0.018425605
                     Locus                   Gene CHR       POS         SNP            P       Effect

The finemap_loci() function and the output

r$> colocFIneMap_Results <- finemap_loci(top_SNPs = top_SNPs,      
                                                      results_dir = "/data/kronos/kronos/acarrasco/FOR_MARYAM/results", 
                                                      loci = c("locus_chr9_rs2821150"), 
                                                      dataset_name = "GWAS_coloc",    
                                                      dataset_type = "GWAS",      
                                                      force_new_subset = T,    
                                                      force_new_LD = T,    
                                                      force_new_finemap = T,    
                                                      remove_tmps = F,    
                                                      fullSS_path = "/data/kronos/kronos/acarrasco/FOR_MARYAM/DLB-vs-Nalls-all.se.rsID_noPoint.txt", 
                                                      query_by = "tabix",    
                          chrom_col = "CHR", position_col = "POS", snp_col = "SNP", pval_col = "woolf_p",  
                          effect_col = "log_woolf_OR", stderr_col = "se", N_cases_col = "NPD", N_controls_col = "NCO",  
                          proportion_cases = "prportion_cases", 
                           
                         bp_distance = 500000*2,   
                          trim_gene_limits = F,   
                            
                          finemap_methods = c("ABF"),                     
                         LD_reference = "UKB",   
                             superpopulation = "EUR",   
                             download_method = "axel",   
                               plot.types=c("fancy"),   
                             ## Generate multiple plots of different window sizes;    
                             ### all SNPs, 4x zoomed-in, and a 50000bp window   
                             plot.zoom = c("all","4x","10x"),   
                             ## XGR   
                             # plot.XGR_libnames=c("ENCODE_TFBS_ClusteredV3_CellTypes"),    
                             ## Roadmap   
                             plot.Roadmap = F,   
                             plot.Roadmap_query = NULL,   
                             # Nott et al. (2019)   
                             plot.Nott_epigenome = T,    
                             plot.Nott_show_placseq = T,    
                             verbose = F )                                                                                                                                                                 
[1] "+ CONDA:: Activating conda env 'echoR'"
[1] "Checking for tabix installation..."
[1] "Checking for bcftools installation..."

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
locus_chr9_rs2821150 (1 / 1)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "Determining chrom type from file header"
[1] "LD:: Standardizing summary statistics subset."
[1] "++ Preparing Gene col"
[1] "++ Preparing A1,A1 cols"
[1] "++ Preparing MAF,Freq cols"
[1] "++ Could not infer MAF"
[1] "++ Preparing N_cases,N_controls cols"
[1] "++ Preparing `proportion_cases` col"
[1] "++ Preparing N col"
[1] "+ Preparing sample_size (N) column"
[1] "++ Computing effective sample size."
[1] "++ Preparing t-stat col"
[1] "+ Calculating t-statistic from Effect and StdErr..."
[1] "++ Assigning lead SNP"
[1] "++ Ensuring Effect, StdErr, P are numeric"
[1] "++ Ensuring 1 SNP per row"
[1] "++ Removing extra whitespace"
[1] "++ Saving subset ==> /data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/locus_chr9_rs2821150_GWAS_coloc_subset.tsv.gz"
[1] "LD:: Using UK Biobank LD reference panel."
[1] "+ UKB LD file name: chr9_112000001_115000001"
[1] "+ LD:: Downloading full .gz/.npz UKB files and saving to disk."
[1] "+ Overwriting pre-existing file."
[1] "+ CONDA:: Identified axel executable in echoR env."
[1] "+ Overwriting pre-existing file."
[1] "+ CONDA:: Identified axel executable in echoR env."
[1] "+ LD:: load_ld() python function input: /data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/LD/chr9_112000001_115000001"
[1] "+ LD:: Reading LD matrix into memory. This could take some time..."
/data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/LD/chr9_112000001_115000001.gz
/data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/LD/chr9_112000001_115000001.npz
Processed URL: /data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/LD/chr9_112000001_115000001
Some other message at the end
[1] "+ Full UKB LD matrix: 22590 x 22590"
[1] "+ Full UKB LD SNP data.table: 22590 x 5"
[1] "UKB MAF:: Extracting MAF from UKB reference."
[1] "+ UKB MAF:: Importing pre-existing file"
Error in `[.data.table`(y, x, nomatch = if (all.x) NA else NULL, on = by,  : 
  logical error. i is not a data.table, but 'on' argument is provided.
Fine-mapping complete in:
Time difference of 1.4 mins
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: SNP

bschilder commented 2 years ago

Hi @alexMarCar, please accept my sincerest apologies for not following up on this.. I finally have some time to go back and address these issues, as well as make some changes to echolocatoR to make it more robust.

Locus names

You're correct, locus names are arbitrary. The only thing that matters is that the loci you specify in loci= matches up with the "Locus" column in "top_SNPs". The "Genes" column within "top_SNPs" is optional, as it's only used when a "Locus" column is not given.

Error messages

Given the error above, it's possible there were simply no overlapping RSIDS between your GWAS locus and the LD panel.

Could you tell me how many SNPs are within "/data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/locus_chr9_rs2821150_GWAS_coloc_subset.tsv.gz"?

alexMarCar commented 2 years ago

Hi,

My apologies this time.

Re. No SNPs overlapping. I see, it makes sense. The number of SNPs in "data/kronos/kronos/acarrasco/FOR_MARYAM/results/GWAS/GWAS_coloc/locus_chr9_rs2821150/locus_chr9_rs2821150_GWAS_coloc_subset.tsv.gz" is 1367.

bschilder commented 2 years ago

No worries, I can add a message making it more clear when the function fails due to lack of overlapping RSIDS. Perhaps with a backup strategy that involves using the genomic coordinates.