RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
30 stars 11 forks source link

finemap_loci works for some loci, but not others? #139

Open dym22 opened 2 months ago

dym22 commented 2 months ago

1. Bug description

When running finemap_loci on my 11 topSNPs, it proceeds successfully with 4 of them but fails for the other 7. The error (when it occurs) appears to happen during Step 2: Extract Linkage Disequilibrium. Specifically, for 7 of the loci, the console outputs "invalid 'path' argument" right after attempting to query the VCF tabix file.

2. Reproducible example

Note: the error also occurs when not using the forcenew* arguments. I am just including them so that it will reproduce the error as if running from scratch.

results <- echolocatoR::finemap_loci(
  fullSS_path = "/ix/ccdg/storage3/dym22/echo/sumstats_formatted",
  topSNPs = "/ix/ccdg/storage3/dym22/echo/topSNPs",
  LD_reference = "1KGphase3" ,
  superpopulation = "EAS",
  dataset_name = "OFC2",
  fullSS_genome_build = "GRCh38",
  bp_distance = 5e5,
  finemap_methods = c("ABF","SUSIE","FINEMAP"),
  munged = TRUE,
  results_dir = "/ix/ccdg/storage3/dym22/echo/echo_results",
  force_new_LD = T,
  force_new_subset = T,
  force_new_finemap = T
)

Console output

When running correctly:

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Step 2 ▶▶▶ Extract Linkage Disequilibrium 🔗 ───────────────────────────────────────────────────────────────────

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
LD_reference identified as: 1kg.
Using 1000Genomes as LD reference panel.
Constructing GRanges query using min/max ranges across one or more chromosomes.
+ as_blocks=TRUE: Will query a single range per chromosome that covers all regions requested (plus anything in between).
LD Reference Panel = 1KGphase3
Querying 1KG remote server.
Selecting 504 EAS individuals from 1kgphase3.
Performing liftover:  hg38  ==>  hg19
Using existing chain file.
========= echotabix::query =========
query_dat is already a GRanges object. Returning directly.
Explicit format: 'vcf'
Performing liftover:  hg38  ==>  hg19
Using existing chain file.
Querying VCF tabix file.
Querying VCF file using: VariantAnnotation
Checking query chromosome style is correct.
Chromosome format: 1
Filtering query to 504 samples and returning ScanVcfParam object.
Retrieving data.
Time difference of 8.6 secs
Removing 610 / 29,392 non-overlapping SNPs.
Saving VCF subset ==> /scratch/slurm-3150011/Rtmppqbvy3/VCF/Rtmppqbvy3.chr13-34361699-35361036.ALL.chr13.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz
Time difference of 0.4 secs
Retrieved data with 610 rows across 504 samples.
echoLD::snpStats:: `MAF` column already present.
echoLD:snpStats:: Computing pairwise LD between 610 SNPs across 504 individuals (stats = R).
Time difference of 0.7 secs
610 x 610 LD_matrix (sparse)
Converting obj to sparseMatrix.
Saving sparse LD matrix ==> /ix/ccdg/storage3/dym22/echo/echo_results/GWAS/OFC2/chr13_33713257_A_T/LD/chr13_33713257_A_T.1KGphase3_LD.RDS
+ FILTER:: Filtering by LD features.

When running incorrectly:

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Step 2 ▶▶▶ Extract Linkage Disequilibrium 🔗 ───────────────────────────────────────────────────────────────────

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
LD_reference identified as: 1kg.
Using 1000Genomes as LD reference panel.
Constructing GRanges query using min/max ranges across one or more chromosomes.
+ as_blocks=TRUE: Will query a single range per chromosome that covers all regions requested (plus anything in between).
LD Reference Panel = 1KGphase3
Querying 1KG remote server.
Selecting 504 EAS individuals from 1kgphase3.
Performing liftover:  hg38  ==>  hg19
Using existing chain file.
========= echotabix::query =========
query_dat is already a GRanges object. Returning directly.
Explicit format: 'vcf'
Performing liftover:  hg38  ==>  hg19
Using existing chain file.
Querying VCF tabix file.
invalid 'path' argumentLocus chr7_111475695_A_C complete in: 0.02 min

Data

[dym22@login0b echo]$ zcat sumstats_formatted.bgz | head
SNP CHR BP  A1  A2  ID  N   FRQ T   SE_T    P   BETA    SE  P_INPUT CONVERGE
rs376804038 1   73209   C   T   chr1:73209:C:T  1399    0.9874911   -1.11188    2.60166 0.669106    0.164271    0.384371    0.669106    1
rs377391031 1   98667   T   C   chr1:98667:T:C  1399    0.9899929   -2.41118    2.41094 0.317263    0.414817    0.414776    0.317263    1
rs1373207528    1   148488  A   G   chr1:148488:A:G 1399    0.00393100000000002 1.68691 1.57535 0.284251    -0.679735   0.63478 0.284251    1
rs1490700034    1   514484  C   T   chr1:514484:C:T 1399    0.9896355   -2.59702    2.4615  0.291399    0.428624    0.406257    0.291399    1
rs201764041 1   595259  G   A   chr1:595259:G:A 1399    0.9446033   -2.61526    5.48041 0.633218    0.0870741   0.182468    0.633218    1
rs61769339  1   727242  G   A   chr1:727242:G:A 1399    0.9124375   6.8175  6.82541 0.317872    -0.146341   0.146511    0.317872    1
rs138476838 1   732994  G   A   chr1:732994:G:A 1399    0.738385    6.5593  10.0037 0.512027    -0.0655441  0.0999627   0.512027    1
rs12238997  1   758351  A   G   chr1:758351:A:G 1399    0.9124375   6.3551  6.83119 0.352213    -0.136185   0.146387    0.352213    1
rs61769351  1   758443  G   C   chr1:758443:G:C 1399    0.915654    8.83591 6.7454  0.190224    -0.194194   0.148249    0.190224    1
[dym22@login0b echo]$ cat topSNPs
SNP,CHR,POS,A1,A2,Locus,N,Freq,T,SE_T,P,Effect,StdErr,P_INPUT,CONVERGE,Gene
rs529674375,12,101376840,G,A,chr12_101376840_G_A,1399,0.9656898,-18.8586,4.1128,4.53268e-06,1.11489,0.24349,4.67656e-06,1,chr12_101376840_G_A
rs9668896,12,93677236,C,T,chr12_93677236_C_T,1399,0.9417441,-26.9601,5.52197,1.04842e-06,0.884163,0.180987,1.03305e-06,1,chr12_93677236_C_T
rs74399411,13,106424444,T,C,chr13_106424444_T_C,1399,0.9889207,-11.5852,2.31447,5.57018e-07,2.16272,0.451328,1.65199e-06,1,chr13_106424444_T_C
rs56360313,13,33713257,A,T,chr13_33713257_A_T,1399,0.68549,51.7592,11.1113,3.18911e-06,-0.419235,0.0903915,3.51809e-06,1,chr13_33713257_A_T
rs4329516,1,209849556,C,T,chr1_209849556_C_T,1399,0.74589,-56.3087,10.6266,1.16535e-07,0.49864,0.093617,1.00184e-07,1,chr1_209849556_C_T
rs115562318,1,233144048,G,A,chr1_233144048_G_A,1399,0.9292352,-28.9316,6.12655,2.3315e-06,0.770799,0.164069,2.62692e-06,1,chr1_233144048_G_A
rs55901108,1,62226519,G,T,chr1_62226519_G_T,1399,0.9792709,-16.5888,3.45108,1.53327e-06,1.39286,0.290675,1.65297e-06,1,chr1_62226519_G_T
rs17461953,1,94085894,A,C,chr1_94085894_A_C,1399,0.832023,46.0283,8.81692,1.78488e-07,-0.592094,0.113384,1.76992e-07,1,chr1_94085894_A_C
rs6812051,4,76569817,A,G,chr4_76569817_A_G,1399,0.467119,-61.9498,11.8453,1.69601e-07,0.441516,0.0843009,1.62866e-07,1,chr4_76569817_A_G
rs9446804,6,72986179,G,A,chr6_72986179_G_A,1399,0.753038,-55.8363,10.1128,3.36418e-08,0.545975,0.0992578,3.78562e-08,1,chr6_72986179_G_A
rs143676388,7,111475695,A,C,chr7_111475695_A_C,1399,0.9446033,-26.4509,5.36831,8.34029e-07,0.917837,0.186902,9.07043e-07,1,chr7_111475695_A_C

3. Session info

> utils::sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2 
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reprex_2.0.2        snpStats_1.50.0     Matrix_1.6-4        survival_3.5-5      topr_2.0.0         
 [6] MungeSumstats_1.8.0 lubridate_1.9.2     forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
[11] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1         tibble_3.2.1        ggplot2_3.5.0      
[16] tidyverse_2.0.0     data.table_1.15.4   echolocatoR_2.0.3  

loaded via a namespace (and not attached):
  [1] fs_1.6.3                                    ProtGenerics_1.34.0                        
  [3] matrixStats_1.3.0                           bitops_1.0-7                               
  [5] EnsDb.Hsapiens.v75_2.99.0                   httr_1.4.7                                 
  [7] RColorBrewer_1.1-3                          Rgraphviz_2.44.0                           
  [9] tools_4.3.0                                 backports_1.4.1                            
 [11] utf8_1.2.4                                  R6_2.5.1                                   
 [13] DT_0.27                                     lazyeval_0.2.2                             
 [15] withr_3.0.0                                 prettyunits_1.2.0                          
 [17] GGally_2.2.1                                gridExtra_2.3                              
 [19] textshaping_0.3.6                           cli_3.6.2                                  
 [21] Biobase_2.62.0                              labeling_0.4.3                             
 [23] ggbio_1.50.0                                mvtnorm_1.2-4                              
 [25] proxy_0.4-27                                mixsqp_0.3-54                              
 [27] systemfonts_1.0.4                           Rsamtools_2.16.0                           
 [29] foreign_0.8-84                              R.utils_2.12.3                             
 [31] dichromat_2.0-0.1                           styler_1.9.1                               
 [33] BSgenome_1.70.2                             maps_3.4.2                                 
 [35] readxl_1.4.3                                susieR_0.12.35                             
 [37] rstudioapi_0.16.0                           RSQLite_2.3.6                              
 [39] httpcode_0.3.0                              pals_1.8                                   
 [41] generics_0.1.3                              BiocIO_1.12.0                              
 [43] echoconda_0.99.9                            zip_2.3.0                                  
 [45] fansi_1.0.6                                 DescTools_0.99.54                          
 [47] S4Vectors_0.40.2                            catalogueR_1.0.1                           
 [49] abind_1.4-5                                 R.methodsS3_1.8.2                          
 [51] lifecycle_1.0.4                             yaml_2.3.8                                 
 [53] SummarizedExperiment_1.32.0                 SparseArray_1.2.4                          
 [55] BiocFileCache_2.10.1                        echoplot_0.99.7                            
 [57] grid_4.3.0                                  blob_1.2.4                                 
 [59] crayon_1.5.2                                dir.expiry_1.8.0                           
 [61] lattice_0.21-8                              GenomicFeatures_1.54.4                     
 [63] KEGGREST_1.42.0                             mapproj_1.2.11                             
 [65] pillar_1.9.0                                knitr_1.46                                 
 [67] GenomicRanges_1.54.1                        rjson_0.2.21                               
 [69] osfr_0.2.9                                  boot_1.3-28.1                              
 [71] gld_2.6.6                                   SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.24   
 [73] SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24    codetools_0.2-19                           
 [75] glue_1.7.0                                  remotes_2.4.2.1                            
 [77] coloc_5.2.3                                 vctrs_0.6.5                                
 [79] png_0.1-8                                   XGR_1.1.9                                  
 [81] cellranger_1.1.0                            gtable_0.3.5                               
 [83] assertthat_0.2.1                            cachem_1.0.8                               
 [85] dnet_1.1.7                                  xfun_0.43                                  
 [87] openxlsx_4.2.5.2                            S4Arrays_1.2.1                             
 [89] gargle_1.4.0                                nlme_3.1-162                               
 [91] usethis_2.1.6                               bit64_4.0.5                                
 [93] progress_1.2.3                              filelock_1.0.3                             
 [95] googleAuthR_2.0.1                           GenomeInfoDb_1.38.8                        
 [97] R.cache_0.16.0                              irlba_2.3.5.1                              
 [99] rpart_4.1.19                                colorspace_2.1-0                           
[101] BiocGenerics_0.48.1                         DBI_1.2.2                                  
[103] Hmisc_5.1-2                                 nnet_7.3-18                                
[105] processx_3.8.1                              Exact_3.2                                  
[107] tidyselect_1.2.1                            bit_4.0.5                                  
[109] compiler_4.3.0                              curl_5.0.0                                 
[111] graph_1.78.0                                htmlTable_2.4.2                            
[113] expm_0.999-9                                basilisk.utils_1.12.0                      
[115] xml2_1.3.4                                  DelayedArray_0.28.0                        
[117] rtracklayer_1.62.0                          checkmate_2.3.1                            
[119] scales_1.3.0                                hexbin_1.28.3                              
[121] callr_3.7.3                                 RBGL_1.78.0                                
[123] echoLD_0.99.9                               RCircos_1.2.2                              
[125] rappdirs_0.3.3                              supraHex_1.40.0                            
[127] digest_0.6.35                               piggyback_0.1.5                            
[129] rmarkdown_2.26                              basilisk_1.12.0                            
[131] XVector_0.42.0                              BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
[133] htmltools_0.5.8.1                           pkgconfig_2.0.3                            
[135] base64enc_0.1-3                             MatrixGenerics_1.14.0                      
[137] echodata_0.99.17                            dbplyr_2.5.0                               
[139] fastmap_1.1.1                               ensembldb_2.26.0                           
[141] rlang_1.1.3                                 htmlwidgets_1.6.4                          
[143] farver_2.1.1                                echofinemap_0.99.5                         
[145] jsonlite_1.8.8                              BiocParallel_1.36.0                        
[147] R.oo_1.26.0                                 VariantAnnotation_1.46.0                   
[149] RCurl_1.98-1.12                             magrittr_2.0.3                             
[151] Formula_1.2-5                               GenomeInfoDbData_1.2.11                    
[153] ggnetwork_0.5.13                            patchwork_1.2.0                            
[155] munsell_0.5.1                               Rcpp_1.0.12                                
[157] ape_5.7-1                                   ggnewscale_0.4.10                          
[159] viridis_0.6.5                               reticulate_1.28                            
[161] stringi_1.8.3                               rootSolve_1.8.2.4                          
[163] zlibbioc_1.48.2                             MASS_7.3-59                                
[165] plyr_1.8.9                                  ggstats_0.5.1                              
[167] parallel_4.3.0                              ggrepel_0.9.5                              
[169] lmom_3.0                                    echoannot_0.99.11                          
[171] Biostrings_2.70.3                           splines_4.3.0                              
[173] hms_1.1.3                                   ps_1.7.5                                   
[175] igraph_1.4.2                                BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000     
[177] reshape2_1.4.4                              biomaRt_2.58.2                             
[179] stats4_4.3.0                                crul_1.4.0                                 
[181] XML_3.99-0.14                               evaluate_0.23                              
[183] biovizBase_1.50.0                           BiocManager_1.30.22                        
[185] tzdb_0.4.0                                  reshape_0.8.9                              
[187] echotabix_0.99.10                           restfulr_0.0.15                            
[189] AnnotationFilter_1.26.0                     e1071_1.7-14                               
[191] downloadR_0.99.6                            ragg_1.2.5                                 
[193] viridisLite_0.4.2                           class_7.3-21                               
[195] OrganismDbi_1.44.0                          memoise_2.0.1                              
[197] AnnotationDbi_1.64.1                        GenomicAlignments_1.38.2                   
[199] IRanges_2.36.0                              cluster_2.1.4                              
[201] timechange_0.2.0 
bschilder commented 2 months ago

It's possible this is related to your sumstats being in hg38, while 1KG is in hg19. The liftover step is imperfect and can result in losing certain SNPs: https://github.com/RajLabMSSM/echoLD/issues/11

Will look into whether there's a way to make the liftover more robust (if that is indeed the problem), or at very least provide the user with a more informative error.

dym22 commented 2 months ago

I had not considered that but pretty easy fix--just ran MungeSumstats::liftover to get them in 19, adjusted everything else accordingly, and it ran without incident.

I'm going to try to get the in-sample LD version running when I have a little more time, but very relieved to at least have something to work with. Thank you so much for your help!