chr1swallace / coloc

Repo for the R package coloc
139 stars 44 forks source link

cannot run Susie for colocalisation between GWAS and eQTL #71

Closed agilly closed 2 years ago

agilly commented 2 years ago

I am trying to use Susie to run colocalisation analysis between GWAS and eQTL. We have hundreds of GWAS signals, some of which have one signal, some of which have multiple independent signals. So far, we have been performing conditional + regular coloc. I was hoping to use Susie to relax the single causal variant assumption in both datasets. However, I have not found a way to run it in a way that makes sense given our data.

Here is an example of a GWAS peak with two independent signals confirmed by GCTA-COJO (it comes from a meta-analysis). image Already at this point I am struggling to get Susie to correctly identify the two subsignals in a satisfactory way. The red and green dots are the credible sets identified by runsusie(sumstat, estimate_prior_variance=F, prior_variance=0.15). When I ran runsusie without overriding any of the prior parameters, I got the error mentioned in #64. I forced the prior variance to be 0.15 squared at first, since my trait is standardised and normalised, but the algorithm returned no credible set. I also tried reducing the coverage. At 0.5, still no credible sets, at 0.01, there is only 1 credible set corresponding to the main signal.

Suspecting a potential issue with allele matching, as you indicated previously, I performed some diagnostics:

check_alignment(sumstat)

image 90% sounds not symmetric so I take it that's good, but it's not entirely positive either. The LD matrix was generated from Plink so it is minor allele oriented, I guess. The alleles in the sumstats have been flipped so that EA = minor allele. Here is the diagnostic plot described in https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html, if that's useful:

condz_in=kriging_rss(z_scores, sumstat$LD)

image

I get the following warnings:

Warning messages:
1: In kriging_rss(z_scores, sumstat$LD) :
  The matrix R is not positive semidefinite. Negative eigenvalues are set to zero.
2: In estimate_s_rss(z, R, r_tol, method = "null-mle") :
  The matrix R is not positive semidefinite. Negative eigenvalues are set to zero

Overall that plot looks very bad and far removed from the normal output.

So, in summary, I am looking for a simple heuristic to detect arbitrary numbers of peaks at varying significances in my datasets. Getting it to work on this example would be a start, and then do the same for the less significant eQTL dataset which also has the same 2 peaks. I would also like to have a sure way to diagnose the datasets. For example for the eQTL dataset, I am not sure of the direction of alleles because of the way GTEx reports AFs and effects. They are ALT allele effects, and the LD matrix is generated via Plink from 1000 genomes VCFs, which turns them to minor. Most of the time, minor alleles in GTEx will be the same as in the wider 1000 genomes, but sometimes they won't. I get 79% of positive ratios for the diagnostic plot. Is that good enough?

Here are the regular coloc results: image

mhaiyue commented 2 years ago

I have the same issue, any suggestions would be appreciated...

agilly commented 2 years ago

@mhaiyue I have not heard back, so I've assumed Susie is not suitable for eQTL/complex trait coloc. We have now reverted to our previous method of running fastcoloc on leave-one-out conditioned datasets.

chr1swallace commented 2 years ago

apologies for delay in response.

susie (and other Bayesian fine mapping methods) do tend to be sensitive to the LD matrix. eg using too few subjects to estimate it, and/or samples from (subtly?) different ancestry.

if you have plink estimating LD by minor allele, does it not provide the identity of the minor allele? if so, you can flip the effect sizes in your data for those snps where the alleles have been flipped (or, equivalently, flip the sign for the corresponding rows and columns of the LD matrix).

My experience using susie on real data has been mixed, and I am still working out rules of thumb that seem sensible.

Eg, if I run susie and get 6+ effects, I presume it has failed, and switch to conditioning. Note that conditional analysis is also sensitive to LD, but perhaps less so. Still, if I see many effects, or if the (conditional) p values for any secondary effects are smaller than any primary effect, I presume it has failed and switch to single coloc.

I am also very careful to try and get an appropriate LD matrix for each dataset, so reading the original paper, checking the origin of the samples, finding the largest public well matched samples to use.

Sorry not to be able to provide a solution, this problem arises because all the Bayesian fine mapping methods work brilliantly assuming complete data (ie full genotypes). When we approximate that by summary statistics + LD, susie still works really really well, if the LD is the actual sample LD. But even relatively small fluctuations can create havoc.

agilly commented 2 years ago

Hi Chris,

thanks for your reply, and sorry for the delay. I did use the population LD for the above dataset. The GWAS data comes from a meta-analysis of 2 populations, and the LD is calculated on the merged genotypes of these 2 populations. For the second part of the analysis which is GTEx eQTL, I was expecting this type of problem since the GTEx genotypes are not easily accessible and 1kG or other external panels must be used as reference. But in this case I was expecting a relatively smooth detection of independent signals.

chr1swallace commented 2 years ago

If you have the full genotype data, then you could run susie directly on that? There's a function (perhaps not documented, I can find or adapt if this matches your situation) to take susie output into coloc

On Thu, 2022-03-24 at 01:14 -0700, Arthur Gilly wrote:

Hi Chris, thanks for your reply, and sorry for the delay in replying. I did use the population LD for the above dataset. The GWAS data comes from a meta-analysis of 2 populations, and the LD is calculated on the merged genotypes of these 2 populations. For the second part of the analysis which is GTEx eQTL, I was expecting this type of problem since the GTEx genotypes are not easily accessible and 1kG or other external panels must be used as reference. But in this case I was expecting a relatively smooth detection of independent signals. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>