bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

MAF filters and p-value distribution. #83

Open akamolphat opened 10 months ago

akamolphat commented 10 months ago

Dear all,

I was wondering how MAF filters affect the p-value distribution. I have RADseq SNP data of a species with very high population structure and I have used a rather low MAF filter (minor allele counts > 10 in a dataset of 886 diploid individuals, approx. MAF > 0.006) to retain as many SNPs as possible. In doing so, I retained 11704 SNPs.

I ran the initial PCA and decided to keep K = 6 (see scree plot, I could also argue for K = 10-12 from previous population structure studies).

9c14650e-6558-4325-b687-48417d0a7ca4

I ran pcadapt with K = 6 and found the histogram of p-values to be U-shaped. I have previously observed this before as well.

cb3b8a27-e605-4969-9103-3de3713e29b4

Rerunning pcadapt with higher min.maf thresholds seem to fix this but reduced the number of SNPs greatly. I also do not want to remove SNPs with MAF <= 0.05 that may potentially be true outliers.

MAF > 0.01: 10242 SNPs fef2e93a-f0f5-43a2-a3a8-e480181a1872

MAF > 0.05: 5648 SNPs d3d752a7-8efe-4c9f-95b0-85b399b3591a

I tried adding NULL SNPs as suggested (https://github.com/bcm-uga/pcadapt/issues/56) but but most of my real SNPs have p-values close to zero.

50000 NULL SNPs added to original dataset (11704 SNPs): f7cc611e-1cdd-40f5-b78c-28899482f232

50000 NULL SNPs added to dataset with only MAF > 0.05 (5646 SNPs): 7ac47643-2ef1-4c68-95b2-986d3c9297a0

What is the reason for this? I have also found a similar pattern when using a matrix of minor allele frequencies (type = "pool") (i.e., u-shaped p-value distribution when including low MAF, and a uniform distribution with a peak near zero when filtering for MAF > 0.05).

I have also tried different K-values, but it appears that the min.maf threshold is what determines the shape of the p-value distribution. Should the "null" SNPs be generated differently?

A

privefl commented 10 months ago

The U-shape is totally expected since we do genomic control by default, which makes p-values conversative when there is some true signal (cf. https://doi.org/10.1038%2Fejhg.2011.39). You can try to look at the raw p-values pchisq(res$stat, df = K, lower.tail = FALSE).

BTW, I would use K=5 from the scree plot you presented, but it would be also good to look at PC scores to see which PCs capture some pop structure.

akamolphat commented 10 months ago

Dear @privefl ,

Thank you so much for such a prompt reply. The plots below are with K = 5.

I looked at the raw p-values but am unsure if the p-value distributions are good enough for to apply FDR correction to them? I am getting a lot of outliers, especially with lower MAF filters. With the original dataset, about 21% of all the SNPs are identified as outliers with FDR < 0.05.

Original dataset (approx. MAF > 0.006), 11704 total SNPs. FDR < 0.05 = 2486 outliers, FDR < 0.1 = 2939 outliers. 5eb8686d-88d6-4f4f-9734-cf7cd3d5e850

MAF > 0.01, 10242 total SNPs. FDR < 0.05 = 1131 outliers, FDR < 0.1 = 1510 outliers. 1fa03611-4e89-4d84-8c29-a3b0cda17306-1

MAF > 0.05, 5648 total SNPs. FDR < 0.05 = 503 outliers, FDR < 0.1 = 671 outliers. f93e6367-5cfc-423b-955c-f1591e9c5957

You suggested that the scree plot suggest K = 5, but the PC plots actually suggest some population structure captured up to PC10 or so (see PC10 vs PC11 below). Population structure analyses using snmf (from LEA package) seems to suggest K = ~10. PC10 vs PC11 ecc69d3b-9ae6-47b5-a4c3-704613362b65

However, when I perform pcadapt with K = 10, but the p-value distribution becomes even more u-shaped (image below), even without GIF correction. a137cd93-d8c9-4cf3-98cd-b52f9a9c11af

I have also tried using MAF instead of individual genotype but that also resulted in an extremely u-shaped distribution, even without GIF correction (see image below). I used K = 3 here, as suggested from the screeplot but the u-shape is present for other K values as well. 9aaf4a70-48c0-4a1c-8c7b-19a83d7f5110-1

Yours sincerely, A

privefl commented 10 months ago

If your pops are perfectly seperated, then it is normal to get that many outliers; it is just too easy for some species. Other have reported the same kind of results here. But if you don't have many variants, and lots of them are outliers, I think it is always good to add e.g. 50K null variants.

privefl commented 10 months ago

Would you be able to send me your data? I would like to try something.

akamolphat commented 10 months ago

Dear @privefl ,

Thank you so much for such prompt reply.

I have sent you my data and the codes I have used to florian.prive.21@gmail.com. I sent you both the genotype data (.bed files) and the MAF data (.lfmm file).

Yours sincerely, A

privefl commented 10 months ago

I do not see any problem then. I guess it is just that your pops are too easily being differentiated.


What I tried:

akamolphat commented 10 months ago

Dear @privefl ,

Thank you so much for spending time on this and also providing the codes of what you have tried. Could I confirm my understanding of what PCAdapt does?

I am understanding that PCAdapt calculates how much each variant is associated with population structure, and variants which are very highly associated to the population structure will appear as outliers (in the chi-squared distribution of the Mahalanobis distances). These outliers are assumed to be indicative of local adaptation. Is that correct?

I wanted to confirm this to try to understand the large number of outliers. My dataset is highly structured and multiple (sub)populations have experienced very strong drift, and this would explain the large number of outliers.

Would this warrant performing PCAdapt on these (sub)populations separately?

Yours sincerely, A

privefl commented 10 months ago

Your understanding is right.

You can always run pcadapt on two populations at once to see what variants differentiate them, but I don't think I've seen pcadapt used like that before.