Error in `[.data.table`(x, r, vars, with = FALSE) : column(s) not found: prob

This issue is a continuation somehow #72 I am showing an example in which we have snps with a prob_col value higher than the threshold but echolocatoR fails in assigning CS and PP to SNPs

[1] "+ FINEMAP:: Importing prob (.snp)..."
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  column(s) not found: prob
We see on the error message how "prob" column cannot fe found

Even though the cred5 file is present, and the tool should have used FINEMAP.import_data.cred(), I am going through

  FINEMAP.import_data.snp <- function(locus_dir,
    # NOTES:
    ## .snp files: Posterior probabilities in this file are the marginal posterior probability
    ## that a given variant is causal.

    # Prob column descriptions:
    ## prob: column the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l-th SNP is the posterior probability that this SNP is causal.
    ## prob_group: the posterior probability that there is at least one causal signal among SNPs in the same group with this SNP.
    printer("+ FINEMAP:: Importing",prob_col,"(.snp)...", v=verbose)
    data.snp <- data.table::fread(file.path(locus_dir,"FINEMAP/data.snp"), nThread = 1)
    data.snp <- data.snp[data.snp[[prob_col]] > credset_thresh,] %>%

When I run this manually, we see how the prob column is present, and how the code works

r$> locus_dir                                                                                      
[1] "/mnt/rreal/RDS/acarrasco/ANALYSES_WORKSPACE/EARLY_PD/POST_GWAS/ECHOLOCATOR/RESULTS/mixedmodels_GWAS/earlymotorPD_axial/MAD1L1/"
r$> data.table::fread(file.path(locus_dir,"FINEMAP/data.snp"), nThread = 1)                        
      index       rsid chromosome position allele1 allele2     maf    beta     se        z
   1:  2535 rs10479762          7  2045351       T       C 0.02180  0.4526 0.1002  4.51697
   2:  3022  rs3800908          7  2159437       T       C 0.02300  0.4925 0.1005  4.90050
   3:  2597 rs11764212          7  2067593       A       C 0.02380  0.5244 0.1022  5.13112
   4:  3027  rs3778978          7  2159817       A       G 0.02760 -0.5088 0.1019 -4.99313
   5:  2495 rs11765549          7  2027311       T       G 0.02270  0.5291 0.1037  5.10222
5864:  2941  rs4719432          7  2140330       A       G 0.02550 -0.5066 0.1004 -5.04582
5865:  2925  rs3778965          7  2138296       A       G 0.02545  0.4901 0.1019  4.80962
5866:  2947  rs4719436          7  2141239       A       G 0.02495  0.4780 0.1016  4.70472
5867:  2547 rs13227554          7  2048220       C       G 0.02210 -0.4523 0.1002 -4.51397
5868:  2923  rs3778964          7  2138109       T       C 0.02560  0.4868 0.1020  4.77255
          prob  log10bf      mean       sd mean_incl  sd_incl
   1: 1.000000 13.43200 -0.310494 0.534601 -0.310494 0.534601
   2: 0.999531  6.89911 -0.301709 0.260602 -0.301850 0.260582
   3: 0.997378  6.15053 -0.307319 0.532777 -0.308127 0.533244
   4: 0.815520  4.21586 -0.232915 0.444014 -0.285603 0.476128
   5: 0.747752  4.04231 -0.262132 0.497410 -0.350560 0.547614
5864: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5865: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5866: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5867: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
5868: 0.000000     -Inf  0.000000 0.000000  0.000000 0.000000
r$> data.snp[data.snp[[prob_col]] > credset_thresh, ] %>% 
          plyr::mutate(CS=1) %>% 
          dplyr::rename(PP=dplyr::all_of(prob_col)) -> data.snp                                    

r$> data.snp                                                                                       
   index       rsid chromosome position allele1 allele2    maf   beta     se       z       PP
1:  2535 rs10479762          7  2045351       T       C 0.0218 0.4526 0.1002 4.51697 1.000000
2:  3022  rs3800908          7  2159437       T       C 0.0230 0.4925 0.1005 4.90050 0.999531
3:  2597 rs11764212          7  2067593       A       C 0.0238 0.5244 0.1022 5.13112 0.997378
    log10bf      mean       sd mean_incl  sd_incl CS
1: 13.43200 -0.310494 0.534601 -0.310494 0.534601  1
2:  6.89911 -0.301709 0.260602 -0.301850 0.260582  1
3:  6.15053 -0.307319 0.532777 -0.308127 0.533244  1

mkoromina commented 2 years ago

Hi, I also receive the same error message even when trying to run the vignette. Did you manage to find a workaround this issue?

bschilder commented 2 years ago

Hi @AMCalejandro and @mkoromina, thanks for bringing this to my attention. I haven't had much time to work on this project in a while but I will try to get back to it soon (possibly this weekend). In the meantime, PRs are more than welcome!

I should also note that I'm working on a long-term project to modularize echolocatoR into different subpackages (with proper unit tests) to help minimize errors.

Thank you for your patience.

mkoromina commented 2 years ago

Hi @bschilder, may I also note to this end, that apart from the above mentioned error message ([Error in [.data.table(x, r, vars, with = FALSE) : column(s) not found: SNP]), I also get this one _"Error in cDict[[chrom_col]] : subscript out of bounds"._ However, stats have been munged beforehand (.parquet format). Any advice as why this error message pops up? Thank you very much in advance.

bschilder commented 2 years ago

Ok, so i think I've fixed this issue with reading in .cred files, as well as with reading in .snp files (when .cred is not available). See here: https://github.com/RajLabMSSM/echolocatoR/issues/72#issuecomment-1059423642

bschilder commented 2 years ago

@mkoromina are you trying to feed in a parquet file to echolocatoR? It currently only supports whatever formats are supported by data.table::fread, so .tsv.gz for example.

mkoromina commented 2 years ago

Hi @bschilder, I am loading .gz files which are actually munged sumstats produced by ldsc. Do you suggest doing any amendments to it? Thanks a lot!

bschilder commented 2 years ago

@mkoromina I don't recommend using LDSC's python script for munging sumstats since it makes a lot of assumptions of column identities (e.g. A1/A2), doesn't have as many colname mappings, doesn't perform any QC or genome build validation, and doesn't map SNPs RSIDs to a standard nomenclature (amongst other limitations).

Please use MungeSumstats which is much more robust. This is what the munged=TRUE flag is referring to specifically in echolocatoR::finemap_loci.

Here's the docs onfinemap_loci: https://rajlabmssm.github.io/echolocatoR/reference/finemap_loci.html

mkoromina commented 2 years ago

@bschilder , thanks so much for this. May I ask you if munged sumstats via polyfun's respective python script will work on echolocatoR? If yes, is there a way of converting them to .tsv.gz files? Will try your MungeSumstats recommendation too as well. Thanks a lot!

bschilder commented 2 years ago

sure, it can still potentially work. you just use pandas to read the parquets into python and then write them as tab-delimited files.

import pandas as pd
dat = pd.read_parquet("<file_path>")
dat.to_csv("<new_path>.tsv.gz", sep="\t")

You could also try out the new read/write_parquet functions I've added to echodata, though this does depend on a functioning echoR conda environment. So might be simpler to just use python directly.