Closed apfuentes closed 4 years ago
Thanks for your feedback.
We did not implement LD clumping for poolseq data because the sample size is usually very small, if I remember correctly. How many "populations" do you have?
What do you mean by performing component-wise? The default is to use the Mahalanobis distance to combine all first K dimensions.
By default, the result of $chi2.stat
is corrected with the GIF, which is conservative and can lead to the distribution of p-value you see. If you prefer, you can use $stat
which is uncorrected, and use some other correction, e.g. using package {qvalue}, or none.
Thanks for your prompt response. Answers below:
For the species I am studying we have 13 populations = 13 pool-seq datasets (for another species we have 9)
I refer to H.4 Component-wise genome scans. I know that the default combines all first K principal components; thus I wonder in which cases it would be better to perform a component-wise analysis (for a single component).
Thanks for the explanation. I was wondering if it was necessary to make a GIF tuning given the bimodal p-value distribution (as opposed to anti-conservative, as commonly expected http://varianceexplained.org/statistics/interpreting-pvalue-histogram/)
Thanks
I guess we could compute some R2 values using only 13 points. If you send one of your matrices, it would be easier for me to implement something efficient. I can look at this at the end of the week.
I'm not sure what would be the benefits. @mblumuga?
What is the distribution of the uncorrected statistics?
Great, many thanks! How many lines you need and how do you prefer I send you this file?
The histogram I shared earlier was obtained with hist(res$pvalues, xlab = 'P-values', main = NULL, breaks = 20, col = 'orange')
. Following your recommendation, the plot below corresponds to the uncorrected p-values using hist(res$stat, xlab = 'Uncorrected P-values', main = NULL, breaks = 20, col = 'orange')
:
You can send it by email, it should not be too big, right? Or dropbox, or drive, as you prefer. An rds file would be great.
You need to use pchisq(res$stat, df = K, lower.tail = FALSE)
to get the uncorrected p-values from the uncorrected chi2 stats.
Done, thanks!!
Upss, thanks for that. The plot below was the result of hist(pchisq(res$stat, df = 11, lower.tail = FALSE), xlab = 'Uncorrected P-values', main = NULL, breaks = 20, col = 'orange')
:
Thanks
I can't see the file you shared.
Looks great.
The link is just error 404.
It should not make a huge difference, gif-corrected is just more conservative.
Sent another link, hope this one works.
I tried clumping on your data. There are two problems unfortunately:
Thanks very much for testing this out, much appreciated it !
Given that this method does not remove LD completely, perhaps it is OK to leave the data as it is? And have in mind that the first PCs will be driven by the long-range LD regions...
What if these regions are actually involved in adaptation?
You can try the code on this branch if you want: https://github.com/bcm-uga/pcadapt/tree/clumping-poolseq
I would probably report only hits outside of these regions if I were you.
Ok, thanks again for taking the time to test this. Best wishes
Hi pcadapt developers, First of all, thanks for creating this useful tool. Second, I’d like to ask 3 questions:
1) I evaluated if LD might be an issue in my Pool-Seq dataset by looking at the plot of PC loadings. From the plot below, I interpret that several genomic regions in high LD are largely contributing to PC1 (this was also confirmed later from individual sequence data that shows large haplotype blocks at these regions). Since the
LD.clumping
function implemented in PCAdapt applies only to genotype data, I was wondering if there is a recommended way to perform such SNP thinning but on Pool-Seq data (population allele frequencies).2) What is the benefit of performing component-wise genome scans? Maybe that it helps to identify better outlier loci highly contributing to a given PC?
3) Would it be necessary to modify the genome inflation factor of this Pool-seq dataset given the P-value distribution shown below? MAF used was 0.05.
Thanks in advance for any help.
Best regards,
Angela