General question - very low variation

fvaux commented 6 years ago

I just wanted to ask a general question about the appropriateness of applying pcadapt...

I have a SNP dataset with very low variation, which results in almost all PCs having near-identical eigen values (~1.5% each). I am therefore unable reliably choose a particular K value for pcadapt. As a test, I have run pcadapt using 2 or 3 PCs, and around 100 sites out of 15,000 SNPs are identified as outliers. When I Iook at other statistics such as Fst values, these 100 sites seem unlikely to be outliers.

Is my dataset simply too uniform for pcadapt to be used meaningfully?

privefl commented 6 years ago

How many individuals do you have?

Does the score plot show some structure? Does the histogram of p-values seem uniform (with some excess near 0)?

mblumuga commented 6 years ago

If you send us your scree plot and the score plot with the first 2 PCs, we might be able to answer.

fvaux commented 6 years ago

I have 96 individuals. The score plot does show structure for PC1 - but I have investigated it further, and have strong reason to believe that it reflects sex (i.e. males on the left, females on the right). There's no apparent structure beyond PC1.

screeplot1 pca-12

Slightly different data, demonstrating that PC1 reflects sex: pcadapt2

Using adegenet, my eigen values and percentages are as follows: [Uploading eigen-summary.xlsx…]()

When I analyse e.g. females only (50 individuals), I still have no structure and the scree plot and PC1+2 look like this: pcadapt1s pcadapt2s

But even then, pcadapt finds 49 outliers for an alpha of 0.1.

So yeah, based on the lack of structure in the PCs (beyond PC1 reflecting sex in the first dataset shown), and the homogenous eigenvalues - should I ignore these outlier results?

I've currently been removing the pcadapt outliers as a precaution for my putative neutral SNP datasets, but I've been wondering if I should simply discount the pcadapt outliers given the low variation among samples.

As an aside, when I do group individuals by sex - pcadapt identifies 93 outliers. Using the same sex dataset, BayeScan also identifies the same 93 outliers. In comparison, when I group individuals geographically - BayeScan never identifies any outliers, whereas pcadapt always does. The 93 'sex' outliers have high Fst values, and blast to some expected genes. The other 'geographic' pcadapt outliers often have negative Fst values and don't blast to anything obvious.

mblumuga commented 6 years ago

I do agree that there is a signal for a single PC that corresponds to sex. If it makes sense to look for outliers w.r.t. to this axis of variation, it makes sense to look at pcadapt ouliers (or Fst outliers). If PC1 is not relevant for outliers/biological adaptation, I would not use pcadapt for your analysis because there are no other PCs that correspond to biological signals.

fvaux commented 6 years ago

Okay, thank you for your insight! I've found pcadapt really useful overall, but just needed to check about what to do when the other PCs don't correspond to any obvious biological signal (etc.).

bcm-uga / pcadapt

General question - very low variation #22