GRealesM / RapidoPGS

A rapido and lightweight method for PGS computation
11 stars 2 forks source link

How to use rapidopgs_multi #7

Closed samkleeman1 closed 3 years ago

samkleeman1 commented 3 years ago

Many thanks for making this awesome package available! We are trying to use it in our workflow but I am finding that rapidopgs_multi does not work at present. I am getting an error message that I have copied below. We are analyzing a quantitative trait.

Thanks in advance!

Dr Sam Kleeman PhD Student Cold Spring Harbor Laboratory, NY

Error in `[.data.table`(ds, , c("CHR", "BP", "SNPID", "REF", "ALT", "BETA", : column(s) not found: n_eff
Traceback:

1. rapidopgs_multi(gwas_QUANT, "./ref-data/", ancestry = "EUR", 
 .     ncores = 16)
2. ds[, c("CHR", "BP", "SNPID", "REF", "ALT", "BETA", "SE", "P", 
 .     "n_eff", "ld.block")]
3. `[.data.table`(ds, , c("CHR", "BP", "SNPID", "REF", "ALT", "BETA", 
 .     "SE", "P", "n_eff", "ld.block"))
4. stop("column(s) not found: ", paste(ansvars[is.na(ansvals)], 
 .     collapse = ", "))
GRealesM commented 3 years ago

Dear Sam, Thank you so much for trying our package, and for making me aware of this bug. Indeed, the "n_eff" shouldn't be in that line. I removed it so now it should work fine. Apologies for the inconvenience, and please let me know if you find more issues or have any comment or question.

Best wishes Guillermo

samkleeman1 commented 3 years ago

Many thanks for the rapid response! That appears to be working now, a lot faster than LDpred2...!!! A couple of additional questions:

  1. It is running now and according to the output about 50% of my SNPs are being 'reversed', I have set up my summary statistics so that A1 is always ALT and A2 is always REF so I was wondering why some SNPs are being reversed. Is the 1000G reference not configured with A1 as ALT and A2 as REF?
  2. How accurate does the heritability estimate need to be? We have performed 'GWAS-by-subtraction' to analyze a latent trait. As a result, the estimate of on-chip heritability is inflated, so I have used the heritability estimate (0.22) from one of the two measured phenotypes we are using to generate latent traits. Does this seem correct?

Thanks so much!

GRealesM commented 3 years ago

Hi Sam, sorry for the delay.

  1. In 1000 Genomes panel, A1 is the minor allele (no effect, since there's no "effect" in a reference panel). Usually, the minor allele matches the effect allele (ALT) in GWAS but in my experience, this is really not always the case, so an ALT in your GWAS may not be the minor allele in your reference population. What RápidoPGS is doing is to ensure that ALT will match A1 and REF A2 in your reference (so if you have multiple datasets, REF and ALT will be the same across all of them). When a ALT/REF pair doesn't match A1/A2 in your reference, it will try to flip them to match, changing the sign of your Beta for consistency so the effect is in the right direction.

  2. Since heritability is required for computing sd.prior, if you're using RapidoPGS-multi you can simply set sd.prior = NULL. This will make the package estimate the sd.prior automatically (which in some cases has proven to outperform setting a "custom" prior). Note that you can't do that for RapidoPGS-single. We have been working on improving our code and we're getting better results for RápidoPGS-multi, so if you give me a couple of days I'll update the package and the docs.

Thanks!

samkleeman1 commented 3 years ago

Many thanks for your response. I have confirmed that the 1000G A1/A2 ordering seems to be based on MAF rather than the GRCh37 reference/alt alleles so that makes sense.

Are you able to modify the package to include the UK Biobank LD reference from LDpred2? From looking at the code I don't think this would be a significant change. The advantage would be that about 10% of SNPs are seemingly being removed for being ambiguous - we are using UKB summary data so this could be avoided with UK biobank LD reference, as then all SNPs are measured on the same chip and so there is no risk of strand mismatch.

GRealesM commented 3 years ago

Yes, that's one of the new features. We're on it!

GRealesM commented 3 years ago

Hi Sam, I updated the package. You can now use the UK Biobank LD matrices by downloading them from https://figshare.com/articles/dataset/European_LD_reference/13034123, uncompress them to a directory, and then indicate the path to rapidopgs_multi(). Please be aware that you'll need to install/re-install coloc before reinstalling RapidoPGS (see Readme). Also, you'll need to specify the number of individuals in the panel (N_LD = 362320). Functions should work fine, but please let me know if you find any issue so I can fix it. Thanks!

samkleeman1 commented 3 years ago

Thanks so much for this. I am trying to run the updated command now and I see the following error message:

Error in `[.data.frame`(ds, chr == chrs, ): object 'chr' not found
Traceback:

1. rapidopgs_multi(gwas_QUANT, trait = "quant", LDmatrices = "ukb_ld/", 
 .     N_LD = 362320, N = 103837, pi_i = 1e-04, ncores = 16, alpha.block = 1e-04, 
 .     alpha.snp = 0.01, sd.prior = NULL)
2. ds[chr == chrs, ]
3. `[.data.frame`(ds, chr == chrs, )
GRealesM commented 3 years ago

Hi Sam, I found out what happened here. Since I use mainly data.table, but I also use some bigsnpr functions (which use data.frame instead), sometimes I forget to convert files back to data.table. It should be fixed now. Thanks for your patience, and please let me know if you run into trouble again.

samkleeman1 commented 3 years ago

Thanks so much!

Some warning messages I am getting below - should I be concerned?

Warning message in sdY.est(vbeta = beta_se^2, maf = ALT_FREQ, n = N):
“estimating sdY from maf and varbeta, please directly supply sdY if known”
Warning message in max(ld[stmp[[i]], stmp[[j]]]^2, na.rm = TRUE):
“no non-missing arguments to max; returning -Inf”
samkleeman1 commented 3 years ago

Also:

Warning message in susie(X, Y, L = L, scaled_prior_variance = prior_variance/var(Y), :
“IBSS algorithm did not converge in 100 iterations!”
GRealesM commented 3 years ago

I think you shouldn't worry about the first warning - it's not a real warning, and I'll remove it in the next commit. For the second, it suggests that there could be something wrong with the LD matrix subset (not sure what exactly, I'm trying to find that out). Actually, it's a bit hard to identify what's happening without seeing the data. If you could check where in the genome this happens, that might be a useful start.

For the third, this is a warning that comes from the original susie() function. You shouldn't worry about it either, since our runsusie() function includes a workaround that keeps iterating when convergence is not achieved (even though you still get the warning). We're trying our best to give useful warnings and error messages, but it will still take us a bit to fix all the possible glitches. Thanks again for letting me know!

samkleeman1 commented 3 years ago

I am getting an error when it finishes chromosome 22:

Error in value[[3L]](cond): Package ‘dplyr’ version 1.0.4 cannot be unloaded:
 Error in unloadNamespace(package) : namespace ‘dplyr’ is imported by ‘tidyr’ so cannot be unloaded

Traceback:

1. library(dplyr)
2. tryCatch(unloadNamespace(package), error = function(e) {
 .     P <- if (!is.null(cc <- conditionCall(e))) 
 .         paste("Error in", deparse(cc)[1L], ": ")
 .     else "Error : "
 .     stop(gettextf("Package %s version %s cannot be unloaded:\n %s", 
 .         sQuote(package), oldversion, paste0(P, conditionMessage(e), 
 .             "\n")), domain = NA)
 . })
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. value[[3L]](cond)
6. stop(gettextf("Package %s version %s cannot be unloaded:\n %s", 
 .     sQuote(package), oldversion, paste0(P, conditionMessage(e), 
 .         "\n")), domain = NA)
GRealesM commented 3 years ago

Related to your “no non-missing arguments to max; returning -Inf” warning: What alpha.block parameter are you using? I discussed this with the coloc team, and we concluded that this happens when the credible sets are empty (ie. no causal variants in the LD block). When RápidoPGS runs, the first thing it does is to cut the genome into smaller pieces, which contain SNPs in varying degrees of LD, but are mainly independent among them (see Berisa and Pickrell, 2016 to see how these blocks where computed). Then, since we're looking for potential causal variants, we filter both blocks and snps within each block by p-value, so blocks with no potentially causal SNPs (ie. all P-values are bigger than a threshold, say 0.01) are skipped. What might have happened here is that, if the threshold is too high, blocks without credible causal SNPs get to susie, which in turn gives back an empty credible set - thus giving the error. We introduced a change so the function doesn't return warnings, but rather set that max value as zero, so please try reinstalling coloc (susie branch).

Regarding the dplyr error, this is very strange, and it seems an environment problem rather than a problem with RápidoPGS itself. Maybe restart R and try again?

samkleeman1 commented 3 years ago

Thanks so much for the explanation, very helpful!

Are you able to change the dose for snp_match(ds, map) so that there is no minimum match proportion?

GRealesM commented 3 years ago

You got it. Try now.