RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
34 stars 11 forks source link

Error in `[.data.table`(x, r, vars, with = FALSE) : column(s) not found: prob #74

Closed AMCalejandro closed 2 years ago

AMCalejandro commented 2 years ago

This issue is a continuation somehow #72 I am showing an example in which we have snps with a prob_col value higher than the threshold but echolocatoR fails in assigning CS and PP to SNPs

acarrasco@kronos:/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/FINEMAP$ cat ../../../../../nohup.out 
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Bioconductor version '3.12' is out-of-date; the current release version '3.14'
  is available with R version '4.1'; see https://bioconductor.org/install
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.5
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

    Locus   Gene CHR     POS       SNP         P Effect
1: MAD1L1 MAD1L1   7 2153071 rs4721411 1.657e-07 0.5318
[1] "+ CONDA:: Activating conda env 'echoR'"
[1] "Checking for tabix installation..."
[1] "Checking for bcftools installation..."

)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
MAD1L1 (1 / 1)
)   )  ) ))))))}}}}}}}} {{{{{{{{{(((((( (  (   (
[1] "Determining chrom type from file header"
[1] "LD:: Standardizing summary statistics subset."
[1] "++ Preparing Gene col"
[1] "++ Preparing A1,A1 cols"
[1] "++ Preparing MAF,Freq cols"
[1] "++ Inferring MAF from frequency column..."
[1] "++ Preparing N_cases,N_controls cols"
[1] "++ Preparing `proportion_cases` col"
[1] "++ 'proportion_cases' not included in data subset."
[1] "++ Preparing N col"
[1] "+ Preparing sample_size (N) column"
[1] "++ Using `sample_size = 3572` TRUE"
[1] "++ Preparing t-stat col"
[1] "+ Calculating t-statistic from Effect and StdErr..."
[1] "++ Assigning lead SNP"
[1] "++ Ensuring Effect, StdErr, P are numeric"
[1] "++ Ensuring 1 SNP per row"
[1] "++ Removing extra whitespace"
[1] "++ Saving subset ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/MAD1L1_earlymotorPD_axial_subset.tsv.gz"
[1] "LD:: Using UK Biobank LD reference panel."
[1] "+ UKB LD file name: chr7_1000001_4000001"
[1] "+ LD:: Downloading full .gz/.npz UKB files and saving to disk."
[1] "+ Overwriting pre-existing file."
[1] "+ CONDA:: Identified axel executable in echoR env."
[1] "+ Overwriting pre-existing file."
[1] "+ CONDA:: Identified axel executable in echoR env."
[1] "+ LD:: load_ld() python function input: /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001"
[1] "+ LD:: Reading LD matrix into memory. This could take some time..."
Processed URL: /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001
Some other message at the end
[1] "+ Full UKB LD matrix: 30690 x 30690"
[1] "+ Full UKB LD SNP data.table: 30690 x 5"
[1] "+ LD:: Saving LD matrix ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/MAD1L1.UKB_LD.RDS"
[1] "5872 x 5872 LD_matrix (sparse)"
vvvvv-- ABF --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- FINEMAP --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- SUSIE --vvvvv
✅ All required columns present.
✅ All suggested columns present.
vvvvv-- POLYFUN_SUSIE --vvvvv
✅ All required columns present.
✅ All suggested columns present.

+++ Multi-finemap:: ABF +++

+++ Multi-finemap:: FINEMAP +++
[1] "++ FINEMAP:: Constructing master file."
[1] "++ FINEMAP:: Constructing data.z file."
[1] "++ FINEMAP:: Constructing data.ld file."
[1] "+ Using FINEMAP v1.4"

|--------------------------------------|
| Welcome to FINEMAP v1.4              |
|                                      |
| (c) 2015-2020 University of Helsinki |
|                                      |
| Help :                               |
| - ./finemap --help                   |
| - www.finemap.me                     |
| - www.christianbenner.com            |
|                                      |
| Contact :                            |
| - christian.benner@helsinki.fi       |
| - matti.pirinen@helsinki.fi          |
|--------------------------------------|

--------
SETTINGS
--------
- dataset            : all
- corr-config        : 0.95
- n-causal-snps      : 5
- n-configs-top      : 50000
- n-conv-sss         : 100
- n-iter             : 100000
- n-threads          : 1
- prior-k0           : 0
- prior-std          : 0.05 
- prob-conv-sss-tol  : 0.001
- prob-cred-set      : 0.95

------------
FINE-MAPPING (1/1)
------------
- GWAS summary stats               : FINEMAP/data.z
- SNP correlations                 : FINEMAP/data.ld
- Causal SNP stats                 : FINEMAP/data.snp
- Causal configurations            : FINEMAP/data.config
- Credible sets                    : FINEMAP/data.cred
- Log file                         : FINEMAP/data.log_sss
- Reading input                    : done!   

- Number of GWAS samples           : 3572
- Number of SNPs                   : 5868
- Prior-Pr(# of causal SNPs is k)  : 
  (0 -> 0)
   1 -> 0.583
   2 -> 0.291
   3 -> 0.0971
   4 -> 0.0243
   5 -> 0.00485
- 154411 configurations evaluated (0.113/100%) : converged after 113 iterations
- Computing causal SNP statistics  : done!   
- Regional SNP heritability        : 0.124 (SD: 0.0667 ; 95% CI: [0.0383,0.293])
- Log10-BF of >= one causal SNP    : 169
- Post-expected # of causal SNPs   : 5
- Post-Pr(# of causal SNPs is k)   : 
  (0 -> 0)
   1 -> 0
   2 -> 0
   3 -> 0
   4 -> 4.67e-19
   5 -> 1
- Computing credible sets          : done!
- Writing output                   : done!   
- Run time                         : 0 hours, 1 minutes, 2 seconds
[1] ".cred not detected. Using .snp instead."
[1] "+ FINEMAP:: Importing prob (.snp)..."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: prob
In addition: Warning message:
In sdY.est(d$varbeta, d$MAF, d$N) :
  estimating sdY from maf and varbeta, please directly supply sdY if known

+++ Multi-finemap:: SUSIE +++
For large R or large XtX, consider installing the  Rfast package for better performance.

+++ Multi-finemap:: POLYFUN_SUSIE +++
[1] "PolyFun:: Preparing SNP input file..."
[1] "+ PolyFun:: 5868 SNPs identified."
[1] "+ PolyFun:: Writing SNP file ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_to_finemap.txt.gz"
[1] "/home/acarrasco/.conda/envs/echoR/bin/python /home/acarrasco/.conda/envs/echoR/lib/R/library/echolocatoR/tools/polyfun//extract_snpvar.py --snps /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_to_finemap.txt.gz --out /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_with_priors.snpvar.tsv.gz"
[INFO]  Writing output file to /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/PolyFun/snps_with_priors.snpvar.tsv.gz
[1] "++ Remove tmp file."
[1] "+ SUSIE:: sample_size= 3572"
[1] "+ SUSIE:: max_causal = 5"
[1] "+ SUSIE:: Utilizing prior_weights for 5868 SNPs."
[1] "+ SUSIE:: Rescaling priors"
[1] "+ Subsetting LD matrix and finemap_dat to common SNPs..."
[1] "+ LD:: Removing unnamed rows/cols"
[1] "+ LD:: Replacing NAs with 0"
[1] "+ LD_matrix = 5868 SNPs."
[1] "+ finemap_dat = 5868 SNPs."
[1] "+ 5868 SNPs in common."
[1] "+ SUSIE:: Using `susie_suff_stat()` from susieR v0.11.92"
For large R or large XtX, consider installing the  Rfast package for better performance.
[1] "+ SUSIE:: Extracting Credible Sets..."
Time difference of 8.18 mins
+-------- Locus Plot: MAD1L1--------+
[1] "+ Filling NAs in CS cols with 0"
[1] "+ Filling NAs in PP cols with 0"
[1] "+ LD:: LD_matrix detected. Coloring SNPs by LD with lead SNP."
max_transcripts=1. 
37 transcripts from 37 genes returned.
Fetching data...OK
Parsing exons...OK
Defining introns...OK
Defining UTRs...OK
Defining CDS...OK
aggregating...
Done
Constructing graphics...
[1] "+ NOTT_2019:: Importing... [1] exvivo_H3K27ac_tbp"
[1] "+ NOTT_2019:: Importing... [2] microglia_H3K27ac"
[1] "+ NOTT_2019:: Importing... [3] neurons_H3K27ac"
[1] "+ NOTT_2019:: Importing... [4] oligodendrocytes_H3K27ac"
[1] "+ NOTT_2019:: Importing... [5] astrocytes_H3K27ac"
[1] "+ NOTT_2019:: Importing... [6] exvivo_atac_tbp"
[1] "+ NOTT_2019:: Importing... [7] microglia_atac"
[1] "+ NOTT_2019:: Importing... [8] neurons_atac"
[1] "+ NOTT_2019:: Importing... [9] oligodendrocytes_atac"
[1] "+ NOTT_2019:: Importing... [10] astrocytes_atac"
[1] "+ NOTT_2019:: Importing... [11] microglia_H3K4me3"
[1] "+ NOTT_2019:: Importing... [12] neurons_H3K4me3"
[1] "+ NOTT_2019:: Importing... [13] oligodendrocytes_H3K4me3"
[1] "+ NOTT_2019:: Importing... [14] astrocytes_H3K4me3"
[1] "+ Inferring genomic limits for window: 1x"
3164573 query SNP(s) detected with reference overlap.
[1] "+ PLOT:: Calculating max histogram height"
[1] "+ NOTT_2019:: Getting interactome data."
[1] "Importing Astrocyte enhancers ..."
[1] "Importing Astrocyte promoters ..."
[1] "Importing Neuronal enhancers ..."
[1] "Importing Neuronal promoters ..."
[1] "Importing Oligo enhancers ..."
[1] "Importing Oligo promoters ..."
[1] "Importing Microglia enhancers ..."
[1] "Importing Microglia promoters ..."
[1] "Importing Microglia interactome ..."
[1] "Importing Neuronal interactome ..."
[1] "Importing Oligo interactome ..."
2755 query SNP(s) detected with reference overlap.
2959 query SNP(s) detected with reference overlap.
[1] "++ NOTT_2019:: Returning PLAC-seq track."
+>+>+>+>+ plot.zoom = all +<+<+<+<+
[1] "+ PLOT:: Get window suffix..."
[1] "+ Ensuring last track shows genomic units..."
[1] "+ PLOT:: Saving plot ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/multiview.MAD1L1.UKB.1x.jpg"
+>+>+>+>+ plot.zoom = 4x +<+<+<+<+
[1] "+ PLOT:: Get window suffix..."
[1] "+ Constructing zoom polygon..."
[1] "+ Highlighting zoom origin..."
[1] "+ Ensuring last track shows genomic units..."
[1] "+ PLOT:: Saving plot ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/multiview.MAD1L1.UKB.4x.jpg"
+>+>+>+>+ plot.zoom = 10x +<+<+<+<+
[1] "+ PLOT:: Get window suffix..."
[1] "+ Constructing zoom polygon..."
[1] "+ Highlighting zoom origin..."
[1] "+ Ensuring last track shows genomic units..."
[1] "+ PLOT:: Saving plot ==> /mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/multiview.MAD1L1.UKB.10x.jpg"

Fine-mapping complete in:
Time difference of 1.4 hours
There were 50 or more warnings (use warnings() to see the first 50)
/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001.gz
/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/LD/chr7_1000001_4000001.npz

We see on the error message how "prob" column cannot fe found

Even though the cred5 file is present, and the tool should have used FINEMAP.import_data.cred(), I am going through

  FINEMAP.import_data.snp <- function(locus_dir,
                                      credset_thresh=.95,
                                      prob_col="prob",
                                      verbose=T){
    # NOTES:
    ## .snp files: Posterior probabilities in this file are the marginal posterior probability
    ## that a given variant is causal.

    # Prob column descriptions:
    ## prob: column the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l-th SNP is the posterior probability that this SNP is causal.
    ## prob_group: the posterior probability that there is at least one causal signal among SNPs in the same group with this SNP.
    ##
    printer("+ FINEMAP:: Importing",prob_col,"(.snp)...", v=verbose)
    data.snp <- data.table::fread(file.path(locus_dir,"FINEMAP/data.snp"), nThread = 1)
    data.snp <- data.snp[data.snp[[prob_col]] > credset_thresh,] %>%
      plyr::mutate(CS=1)%>%
      dplyr::rename(PP=dplyr::all_of(prob_col))
    return(data.snp)
  }

When I run this manually, we see how the prob column is present, and how the code works

r$> locus_dir                                                                                      
[1] "/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/"
r$> data.table::fread(file.path(locus_dir,"FINEMAP/data.snp"), nThread = 1)                        
      index       rsid chromosome position allele1 allele2     maf    beta     se        z
   1:  2535 rs10479762          7  2045351       T       C 0.02180  0.4526 0.1002  4.51697
   2:  3022  rs3800908          7  2159437       T       C 0.02300  0.4925 0.1005  4.90050
   3:  2597 rs11764212          7  2067593       A       C 0.02380  0.5244 0.1022  5.13112
   4:  3027  rs3778978          7  2159817       A       G 0.02760 -0.5088 0.1019 -4.99313
   5:  2495 rs11765549          7  2027311       T       G 0.02270  0.5291 0.1037  5.10222
  ---                                                                                     
5864:  2941  rs4719432          7  2140330       A       G 0.02550 -0.5066 0.1004 -5.04582
5865:  2925  rs3778965          7  2138296       A       G 0.02545  0.4901 0.1019  4.80962
5866:  2947  rs4719436          7  2141239       A       G 0.02495  0.4780 0.1016  4.70472
5867:  2547 rs13227554          7  2048220       C       G 0.02210 -0.4523 0.1002 -4.51397
5868:  2923  rs3778964          7  2138109       T       C 0.02560  0.4868 0.1020  4.77255
          prob  log10bf      mean       sd mean_incl  sd_incl
   1: 1.000000 13.43200 -0.310494 0.534601 -0.310494 0.534601
   2: 0.999531  6.89911 -0.301709 0.260602 -0.301850 0.260582
   3: 0.997378  6.15053 -0.307319 0.532777 -0.308127 0.533244
   4: 0.815520  4.21586 -0.232915 0.444014 -0.285603 0.476128
   5: 0.747752  4.04231 -0.262132 0.497410 -0.350560 0.547614
  ---                                                        
5864: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5865: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5866: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5867: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5868: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
r$> data.snp[data.snp[[prob_col]] > credset_thresh, ] %>% 
          plyr::mutate(CS=1) %>% 
          dplyr::rename(PP=dplyr::all_of(prob_col)) -> data.snp                                    

r$> data.snp                                                                                       
   index       rsid chromosome position allele1 allele2    maf   beta     se       z       PP
1:  2535 rs10479762          7  2045351       T       C 0.0218 0.4526 0.1002 4.51697 1.000000
2:  3022  rs3800908          7  2159437       T       C 0.0230 0.4925 0.1005 4.90050 0.999531
3:  2597 rs11764212          7  2067593       A       C 0.0238 0.5244 0.1022 5.13112 0.997378
    log10bf      mean       sd mean_incl  sd_incl CS
1: 13.43200 -0.310494 0.534601 -0.310494 0.534601  1
2:  6.89911 -0.301709 0.260602 -0.301850 0.260582  1
3:  6.15053 -0.307319 0.532777 -0.308127 0.533244  1

mkoromina commented 2 years ago

Hi, I also receive the same error message even when trying to run the vignette. Did you manage to find a workaround this issue?

bschilder commented 2 years ago

Hi @AMCalejandro and @mkoromina, thanks for bringing this to my attention. I haven't had much time to work on this project in a while but I will try to get back to it soon (possibly this weekend). In the meantime, PRs are more than welcome!

I should also note that I'm working on a long-term project to modularize echolocatoR into different subpackages (with proper unit tests) to help minimize errors.

Thank you for your patience.

mkoromina commented 2 years ago

Hi @bschilder, may I also note to this end, that apart from the above mentioned error message ([Error in [.data.table(x, r, vars, with = FALSE) : column(s) not found: SNP]), I also get this one _"Error in cDict[[chrom_col]] : subscript out of bounds"._ However, stats have been munged beforehand (.parquet format). Any advice as why this error message pops up? Thank you very much in advance.

bschilder commented 2 years ago

Ok, so i think I've fixed this issue with reading in .cred files, as well as with reading in .snp files (when .cred is not available). See here: https://github.com/RajLabMSSM/echolocatoR/issues/72#issuecomment-1059423642

bschilder commented 2 years ago

@mkoromina are you trying to feed in a parquet file to echolocatoR? It currently only supports whatever formats are supported by data.table::fread, so .tsv.gz for example.

mkoromina commented 2 years ago

Hi @bschilder, I am loading .gz files which are actually munged sumstats produced by ldsc. Do you suggest doing any amendments to it? Thanks a lot!

bschilder commented 2 years ago

@mkoromina I don't recommend using LDSC's python script for munging sumstats since it makes a lot of assumptions of column identities (e.g. A1/A2), doesn't have as many colname mappings, doesn't perform any QC or genome build validation, and doesn't map SNPs RSIDs to a standard nomenclature (amongst other limitations).

Please use MungeSumstats which is much more robust. This is what the munged=TRUE flag is referring to specifically in echolocatoR::finemap_loci.

Here's the docs onfinemap_loci: https://rajlabmssm.github.io/echolocatoR/reference/finemap_loci.html

mkoromina commented 2 years ago

@bschilder , thanks so much for this. May I ask you if munged sumstats via polyfun's respective python script will work on echolocatoR? If yes, is there a way of converting them to .tsv.gz files? Will try your MungeSumstats recommendation too as well. Thanks a lot!

bschilder commented 2 years ago

sure, it can still potentially work. you just use pandas to read the parquets into python and then write them as tab-delimited files.

import pandas as pd
dat = pd.read_parquet("<file_path>")
dat.to_csv("<new_path>.tsv.gz", sep="\t")

You could also try out the new read/write_parquet functions I've added to echodata, though this does depend on a functioning echoR conda environment. So might be simpler to just use python directly.