bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

LD clumping on Pool-Seq data #49

Closed apfuentes closed 4 years ago

apfuentes commented 4 years ago

Hi pcadapt developers, First of all, thanks for creating this useful tool. Second, I’d like to ask 3 questions:

1) I evaluated if LD might be an issue in my Pool-Seq dataset by looking at the plot of PC loadings. From the plot below, I interpret that several genomic regions in high LD are largely contributing to PC1 (this was also confirmed later from individual sequence data that shows large haplotype blocks at these regions). Since the LD.clumping function implemented in PCAdapt applies only to genotype data, I was wondering if there is a recommended way to perform such SNP thinning but on Pool-Seq data (population allele frequencies).

Loadings_PC_1-4_12_pops_14p_Neff_minDP20x_LAT

2) What is the benefit of performing component-wise genome scans? Maybe that it helps to identify better outlier loci highly contributing to a given PC?

3) Would it be necessary to modify the genome inflation factor of this Pool-seq dataset given the P-value distribution shown below? MAF used was 0.05.

Histogram-Pvalues_9_pool_HOM_Neff_minDP20_MAF2_SPW

Thanks in advance for any help.

Best regards,

Angela

privefl commented 4 years ago

Thanks for your feedback.

  1. We did not implement LD clumping for poolseq data because the sample size is usually very small, if I remember correctly. How many "populations" do you have?

  2. What do you mean by performing component-wise? The default is to use the Mahalanobis distance to combine all first K dimensions.

  3. By default, the result of $chi2.stat is corrected with the GIF, which is conservative and can lead to the distribution of p-value you see. If you prefer, you can use $stat which is uncorrected, and use some other correction, e.g. using package {qvalue}, or none.

apfuentes commented 4 years ago

Thanks for your prompt response. Answers below:

  1. For the species I am studying we have 13 populations = 13 pool-seq datasets (for another species we have 9)

  2. I refer to H.4 Component-wise genome scans. I know that the default combines all first K principal components; thus I wonder in which cases it would be better to perform a component-wise analysis (for a single component).

  3. Thanks for the explanation. I was wondering if it was necessary to make a GIF tuning given the bimodal p-value distribution (as opposed to anti-conservative, as commonly expected http://varianceexplained.org/statistics/interpreting-pvalue-histogram/)

Thanks

privefl commented 4 years ago
  1. I guess we could compute some R2 values using only 13 points. If you send one of your matrices, it would be easier for me to implement something efficient. I can look at this at the end of the week.

  2. I'm not sure what would be the benefits. @mblumuga?

  3. What is the distribution of the uncorrected statistics?

apfuentes commented 4 years ago
  1. Great, many thanks! How many lines you need and how do you prefer I send you this file?

  2. The histogram I shared earlier was obtained with hist(res$pvalues, xlab = 'P-values', main = NULL, breaks = 20, col = 'orange'). Following your recommendation, the plot below corresponds to the uncorrected p-values using hist(res$stat, xlab = 'Uncorrected P-values', main = NULL, breaks = 20, col = 'orange'): Uncorrected_Pvalues

privefl commented 4 years ago
  1. You can send it by email, it should not be too big, right? Or dropbox, or drive, as you prefer. An rds file would be great.

  2. You need to use pchisq(res$stat, df = K, lower.tail = FALSE) to get the uncorrected p-values from the uncorrected chi2 stats.

apfuentes commented 4 years ago
  1. Done, thanks!!

  2. Upss, thanks for that. The plot below was the result of hist(pchisq(res$stat, df = 11, lower.tail = FALSE), xlab = 'Uncorrected P-values', main = NULL, breaks = 20, col = 'orange'):

Screenshot 2020-04-09 at 12 26 01

Thanks

privefl commented 4 years ago
  1. I can't see the file you shared.

  2. Looks great.

apfuentes commented 4 years ago
  1. I shared a dropbox link to your email.
  2. Shall I keep using the standard p-values (res$pvalues) or the uncorrected p-values for outlier detection?
privefl commented 4 years ago
  1. The link is just error 404.

  2. It should not make a huge difference, gif-corrected is just more conservative.

apfuentes commented 4 years ago

Sent another link, hope this one works.

privefl commented 4 years ago

I tried clumping on your data. There are two problems unfortunately:

  1. it does not completely solve your problem of long-range LD regions
  2. the way we handle pool-seq data means that you won't get any statistic for the variants removed from clumping.
apfuentes commented 4 years ago

Thanks very much for testing this out, much appreciated it !

  1. How did you perform LD clumping for Pool-seq data? Would it be possible to share the code?

Given that this method does not remove LD completely, perhaps it is OK to leave the data as it is? And have in mind that the first PCs will be driven by the long-range LD regions...

What if these regions are actually involved in adaptation?

  1. It makes sense
privefl commented 4 years ago

You can try the code on this branch if you want: https://github.com/bcm-uga/pcadapt/tree/clumping-poolseq

privefl commented 4 years ago

I would probably report only hits outside of these regions if I were you.

apfuentes commented 4 years ago

Ok, thanks again for taking the time to test this. Best wishes