bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

Is the new version of PCAdapt applicable to reduced representation population genomic datasets with 1000's of markers? #64

Closed ptcooper closed 3 years ago

ptcooper commented 3 years ago

PCAdapt would helpful for my dataset because of the lack of a demographic model and the presence of admixture in my dataset.

However, it is a ddRAD dataset generated from a population genomics project with a non-model organism with a de-novo reference. As a result it has 1000's of markers. In many issues, these types of numbers are brought up as problematically low.

Similar to what is mentioned in "selecting K's and filtering criteria to use #55": These RAD datasets may start with 100,000's of variants but they result in 1000's of markers because of filtering , and the fact that the short reads need to be thinned to one SNP per contig to reduce linkage or haplotyped (which creates multiallelic markers).

This allows accurate Fst estimates to be calculated and physical linkage to be reduced.

It is mentioned in a pinned issue that adding null markers is needed for such small datasets (1000's of markers). Wouldn't adding null markers affect the false discovery rate as well as the distribution of p-values in an artificial manner? Therefore this could be an issue for those in population genomics who often use this method to pull out outliers.

In the vignette, https://github.com/bcm-uga/pcadapt/blob/master/vignettes/pcadapt.Rmd, a dataset with thousands of markers is used: "A total of 150 individuals coming from three different populations were genotyped at 1,500 diploid markers"

Am I misunderstanding something or this type of dataset no longer applicable?

Is an older version of the method more applicable for defining outlier loci in population genomics datasets?

Thank You!

privefl commented 3 years ago

I think adding null variants is really the way to go for this kind of data.

For the size in the tutorial, this is just some fake data for demonstration.