bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

Problem interpreting strange results #38

Closed spencer411 closed 4 years ago

spencer411 commented 5 years ago

I'm working with a large WGS dataset (haploid clonal pathogen; 358 isolates), that exhibits very little recombination. I am having trouble interpreting the results and am just looking to get some answers on what might be going on (whether the results suggests very high LD across my data-set, or if something is just wrong with my input file altogether). Attached are some examples of the plots I'm receiving back. Results are rather consistent no matter what K I choose (although in the results here, I am working with K = 6). Any help/input is appreciated. plots

privefl commented 5 years ago

The problem might come from the fact that your species is haploid. Is it coded as 0s and 1s? You might want to recode it with 0s and 2s as pcadapt expects some diploid species.

mblumuga commented 5 years ago

First, you need to specify the ploidy using ploidy=1 in the pcadapt function. Then LD might be strong in your dataset and population structure should be estimated using LD thinning. You should read an apply the code provided in the section "F. Linkage Disequilibrium (LD) thinning" of the vignette.

spencer411 commented 5 years ago

First off, thanks for the super speedy reply.

I did not originally specify ploidy, although when I did (and thinned per the vignette), my results were not that different. The code below gave me the following results:

x <- pcadapt(input = filename, ploidy=1, K = 20, LD.clumping = list(size = 200, thr = 0.1)) plot(x, option = "screeplot") x <- pcadapt(filename, ploidy = 1, K = 7, method = c("mahalanobis", "componentwise"), min.maf = 0.05, LD.clumping = list(size = 200, thr = 0.1), pca.only = FALSE) summary(x) plot(x , option = "manhattan") plot(x, option = "qqplot") hist(x$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange") plot(x, option = "stat.distribution") plots2

privefl commented 5 years ago

It looks similar to https://github.com/bcm-uga/pcadapt/issues/24.

mblumuga commented 5 years ago

I suspect that p-values equal to 0 correspond to fixed differences between populations. I have seen that previously in some datasets. I suggest

  1. to use score plots (PC2 vs PC1, PC4 vs PC3...) to check that population structure as ascertained with PCA is meaningful.
  2. to look for the SNPs with the smallest P-values to see if it corresponds to fixed differences between populations.
spencer411 commented 4 years ago

I have had a chance to have back at this dataset, and below is where I'm at in terms of mblumuga's comment above:

Comment 1. to use score plots (PC2 vs PC1, PC4 vs PC3...) to check that population structure as ascertained with PCA is meaningful.

I have done this and K = 2 seems the best bet based on the revised scree plot and score plots below, nevertheless despite LD.Clumping the significance values still seem oddly distributed.

  1. to look for the SNPs with the smallest P-values to see if it corresponds to fixed differences between populations.

Note that in the last figure below there is a phylogenetic tree with the number of minor allele frequencies for the lowest 10 p-values denoted with black barplots on the outside of the population designations (delineated by color and numbered according to the PC score plots). All of these smallest p-values correspond to this major two way split in the tree (so yes).

Given these results, my questions are:

  1. Are my results sound, or is a clonal asexual haploid pathogen problematic under the assumptions of pcadapt? Is population structure in my dataset just too strong to detect true outliers?

  2. Would some additional analysis help to narrow down my potentially adaptive loci? I used the Bonferroni method here, yet my p-values are still at the bottom of my manhattan plot (see the black line in second to last figure). Is there any way to narrow down my pool of loci (adjust p-value based on the data)?

Any inference or suggestions based on my results is greatly, greatly appreciated. All figures and code used below: pcadapt_outputs

PCAdapt_Manhattan_New.txt

Huyuxi08 commented 4 years ago

Hi, @spencer411

Have you solved this problem? I got the same strange results as you, and I do not have any clue about the results. Hope to discuss with you.