Closed samkleeman1 closed 3 years ago
Dear Sam, Thank you so much for trying our package, and for making me aware of this bug. Indeed, the "n_eff" shouldn't be in that line. I removed it so now it should work fine. Apologies for the inconvenience, and please let me know if you find more issues or have any comment or question.
Best wishes Guillermo
Many thanks for the rapid response! That appears to be working now, a lot faster than LDpred2...!!! A couple of additional questions:
Thanks so much!
Hi Sam, sorry for the delay.
In 1000 Genomes panel, A1 is the minor allele (no effect, since there's no "effect" in a reference panel). Usually, the minor allele matches the effect allele (ALT) in GWAS but in my experience, this is really not always the case, so an ALT in your GWAS may not be the minor allele in your reference population. What RápidoPGS is doing is to ensure that ALT will match A1 and REF A2 in your reference (so if you have multiple datasets, REF and ALT will be the same across all of them). When a ALT/REF pair doesn't match A1/A2 in your reference, it will try to flip them to match, changing the sign of your Beta for consistency so the effect is in the right direction.
Since heritability is required for computing sd.prior, if you're using RapidoPGS-multi
you can simply set sd.prior = NULL
. This will make the package estimate the sd.prior automatically (which in some cases has proven to outperform setting a "custom" prior). Note that you can't do that for RapidoPGS-single
.
We have been working on improving our code and we're getting better results for RápidoPGS-multi, so if you give me a couple of days I'll update the package and the docs.
Thanks!
Many thanks for your response. I have confirmed that the 1000G A1/A2 ordering seems to be based on MAF rather than the GRCh37 reference/alt alleles so that makes sense.
Are you able to modify the package to include the UK Biobank LD reference from LDpred2? From looking at the code I don't think this would be a significant change. The advantage would be that about 10% of SNPs are seemingly being removed for being ambiguous - we are using UKB summary data so this could be avoided with UK biobank LD reference, as then all SNPs are measured on the same chip and so there is no risk of strand mismatch.
Yes, that's one of the new features. We're on it!
Hi Sam, I updated the package. You can now use the UK Biobank LD matrices by downloading them from https://figshare.com/articles/dataset/European_LD_reference/13034123, uncompress them to a directory, and then indicate the path to rapidopgs_multi()
.
Please be aware that you'll need to install/re-install coloc before reinstalling RapidoPGS (see Readme). Also, you'll need to specify the number of individuals in the panel (N_LD = 362320).
Functions should work fine, but please let me know if you find any issue so I can fix it.
Thanks!
Thanks so much for this. I am trying to run the updated command now and I see the following error message:
Error in `[.data.frame`(ds, chr == chrs, ): object 'chr' not found
Traceback:
1. rapidopgs_multi(gwas_QUANT, trait = "quant", LDmatrices = "ukb_ld/",
. N_LD = 362320, N = 103837, pi_i = 1e-04, ncores = 16, alpha.block = 1e-04,
. alpha.snp = 0.01, sd.prior = NULL)
2. ds[chr == chrs, ]
3. `[.data.frame`(ds, chr == chrs, )
Hi Sam, I found out what happened here. Since I use mainly data.table, but I also use some bigsnpr functions (which use data.frame instead), sometimes I forget to convert files back to data.table. It should be fixed now. Thanks for your patience, and please let me know if you run into trouble again.
Thanks so much!
Some warning messages I am getting below - should I be concerned?
Warning message in sdY.est(vbeta = beta_se^2, maf = ALT_FREQ, n = N):
“estimating sdY from maf and varbeta, please directly supply sdY if known”
Warning message in max(ld[stmp[[i]], stmp[[j]]]^2, na.rm = TRUE):
“no non-missing arguments to max; returning -Inf”
Also:
Warning message in susie(X, Y, L = L, scaled_prior_variance = prior_variance/var(Y), :
“IBSS algorithm did not converge in 100 iterations!”
I think you shouldn't worry about the first warning - it's not a real warning, and I'll remove it in the next commit. For the second, it suggests that there could be something wrong with the LD matrix subset (not sure what exactly, I'm trying to find that out). Actually, it's a bit hard to identify what's happening without seeing the data. If you could check where in the genome this happens, that might be a useful start.
For the third, this is a warning that comes from the original susie()
function. You shouldn't worry about it either, since our runsusie()
function includes a workaround that keeps iterating when convergence is not achieved (even though you still get the warning). We're trying our best to give useful warnings and error messages, but it will still take us a bit to fix all the possible glitches. Thanks again for letting me know!
I am getting an error when it finishes chromosome 22:
Error in value[[3L]](cond): Package ‘dplyr’ version 1.0.4 cannot be unloaded:
Error in unloadNamespace(package) : namespace ‘dplyr’ is imported by ‘tidyr’ so cannot be unloaded
Traceback:
1. library(dplyr)
2. tryCatch(unloadNamespace(package), error = function(e) {
. P <- if (!is.null(cc <- conditionCall(e)))
. paste("Error in", deparse(cc)[1L], ": ")
. else "Error : "
. stop(gettextf("Package %s version %s cannot be unloaded:\n %s",
. sQuote(package), oldversion, paste0(P, conditionMessage(e),
. "\n")), domain = NA)
. })
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. value[[3L]](cond)
6. stop(gettextf("Package %s version %s cannot be unloaded:\n %s",
. sQuote(package), oldversion, paste0(P, conditionMessage(e),
. "\n")), domain = NA)
Related to your “no non-missing arguments to max; returning -Inf” warning: What alpha.block parameter are you using? I discussed this with the coloc team, and we concluded that this happens when the credible sets are empty (ie. no causal variants in the LD block). When RápidoPGS runs, the first thing it does is to cut the genome into smaller pieces, which contain SNPs in varying degrees of LD, but are mainly independent among them (see Berisa and Pickrell, 2016 to see how these blocks where computed). Then, since we're looking for potential causal variants, we filter both blocks and snps within each block by p-value, so blocks with no potentially causal SNPs (ie. all P-values are bigger than a threshold, say 0.01) are skipped. What might have happened here is that, if the threshold is too high, blocks without credible causal SNPs get to susie, which in turn gives back an empty credible set - thus giving the error. We introduced a change so the function doesn't return warnings, but rather set that max value as zero, so please try reinstalling coloc (susie branch).
Regarding the dplyr error, this is very strange, and it seems an environment problem rather than a problem with RápidoPGS itself. Maybe restart R and try again?
Thanks so much for the explanation, very helpful!
Are you able to change the dose for snp_match(ds, map)
so that there is no minimum match proportion?
You got it. Try now.
Many thanks for making this awesome package available! We are trying to use it in our workflow but I am finding that rapidopgs_multi does not work at present. I am getting an error message that I have copied below. We are analyzing a quantitative trait.
Thanks in advance!
Dr Sam Kleeman PhD Student Cold Spring Harbor Laboratory, NY