bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

Choosing the best K: screeplot vs. scoreplot #35

Closed ydorant closed 5 years ago

ydorant commented 5 years ago

Hi Michael and team PCAdapt

I have a problem to understand the results plot (screeplot and scoreplot) from PCAdapt in my dataset. I'm currently working on a subsampling part of my broadscale dataset based on GBS-capture SNPs for a lobster specie. So, my data here contains 389 individuals distributed among 4 sampling sites (N over pop : 146, 140, 53 50). The vcf file contains 6,150 SNPs and the average of missing data represents < 5%. *Note: Population genetic information about my species - very weak level of genetic differentiation (FST < 0.01).

Here it's my code lines:

filename <- read.pcadapt(input = "batch_7_max_maf_T1.recode.vcf", type = "vcf")

x <- pcadapt(input = filename, K = 12, ploidy = 2, min.maf = 0.05)

#screeplot --> choose the x-axis break (see tutorial)
plot(x, option = "screeplot")

popmap <- read.table('strata_max_maf_T1.txt', h = F)[,2] 
poplist.names = as.vector(popmap) %>% as.data.frame %>% set_colnames(., c('POP'))
npop <- nrow(unique(poplist.names))

## trace PCA with samples
plot(x, option = "scores", pop = poplist.names$POP, col = rainbow(npop))

If I refer to the first figure below (Fig.1) Fig1_PCAdapt_screeplot The best K value could be K=2 or K=3

On the other hand, I tested the second option to determine the K value, the score plot. Fig2_PCAdapt_scoreplot

Here, I loaded the STRATA mapInfo of my samples affiliated to their sampling site as the "population info" to plot the graph. Based on this results, we can observed 4 individuals as outliers, driving most of the variance on the two first axis. All the other individuals are diplayed in the same cloud.

Then, I decided to remove this four outliers samples and re-run PCAdapt. Again, I found two other outliers individuals. Ok. Then I removed them and re-run a third PCAdapt. Here are the two graphs for the 389 -6 bad samples = 383. Fig3 PCAdapt_screeplot_bad_samples_removed Fig4_PCAdapt_scoreplot_bad_samples_removed

For me, the signal of potential structure is lost and the best K value is 1. To this, can we use PCAdapt in this context ?

The question is, is the K method (i.e. screeplot and scroreplot) clearly related ? In a context of blind structure, should we use the screeplot method only ?

What is you feeling about this ?

Thanks guys. Yann,

mblumuga commented 5 years ago

Your approach of using the score in addition to the scree plot is recommended. I would have done the same and I would also have removed outliers. Which is disappointing is that there is no more structure after removal of outliers. If there is no pop structure, you cannot use pcadapt because it looks for SNPs excessively related with pop structure (which should exist!). Have you try to run e.g. structure to see if you find pop structure after removal of outliers? If you do not find pop structure, pcadapt is not suited to your dataset;

ydorant commented 5 years ago

Thank you Michael. Yeah ok I understand the way. Indeed, with PCAdapt you have to know a priori the signal of population structure. For my study species, we know at the broad scale that the population genetic signal is very week and then admixed. So, in this context, regarding at very small scale patterns among transects, PCAdapt is not the best way to discriminate markers potentially under selection. I think, I will test genome scans with environmental factor (RDA, LFMM, BayPass) in order to potentially detect any signal.

Thanks for your help. Yann.

privefl commented 5 years ago

You don't need to know a priori the signal of population structure when using {pcadapt}. It should be detected via PCA, if there is any.

mblumuga commented 5 years ago

Yes, @privefl is right. You can use pca as implemented in pcadapt to detect if there is pop structure. In your dataset, pca provides no evidence for pop structure.

ydorant commented 5 years ago

Yes of course, classical PCA and PCoa did not highlighted any pop structure. The system is very difficult to understand. Screening potential outliers SNPs via GEA could be a way. I will test it.