Open sandyplus opened 1 year ago
How many variants do you have as input? Have you checked https://github.com/bcm-uga/pcadapt/issues/56?
@privefl Thanks for your quick response.
I identified 247,899 SNPs as outliers using a Q-value threshold of < 0.01, and 37,743 using the Bonferroni adjusted p-value threshold out of 7,856,218 SNPs. However, there was little overlap with RDA and LFMM with the Bonferroni method.
I reviewed the issue and confirmed that I have enough variants for pcadapt to calculate the mahalanobis distance. Let me know if you need more information.
Best regards, Sandy
FYI, I divided the chromosome into 200 parts and merged them as input for pcadapt, which may have caused some misordering of SNPs. Should I sort the SNPs according to their respective chromosomes?
Ok, the number of variants is not the problem then.
The other obvious next issue can be LD. Are you capturing any LD in the PCA? But you seem to be using only 2 PCs, so that seems unlikely.. You can always try to use some pruning.
Any update on this?
@privefl I'm sorry for the delayed response. I was unable to detect any LD in the PCA. Additionally, I attempted to implement prune using command "pca <- pcadapt(ddl, K = bestk, method="mahalanobis",min.maf=0.05, LD.clumping = list(size = 200, thr = 0.1))", but the outcomes were almost identical with minor variations.
Then, maybe the results are fine. How many samples do you have?
You might want to increase the size
in LD.clumping
to e.g. 10000, given the large number of variants you have.
I have 79 individuals. Alright, I will attempt to increase the size in LD.clumping and update you with the results. Thanks for your suggestion.
That's a very small sample. Do you have populations that separate perfectly on the PC plot?
Yes, I have 16 populations which are separated by K = 2. So I used K = 2 for pcadapt analysis.
They are seperated in 3 groups clearly. Maybe results are just fine.
How does the histogram of pvalues look like?
It looks good.
What's the status on this issue?
Hello, everyone,
I used pcadapt to identify environment-related outliers, but I obtained an excessive number of them. Is there anything I overlooked? Best regards, Sandy
PS: The code I used:
Here is the R session info:
Here is the plot output. The blue and light blue colored points indicate the outliers, while the black and grey colored points represent the non-outliers. The grey dashed line represents the threshold q-value of less than 0.01, and the red line represents the threshold Bonferroni adjusted p-value of less than 0.05. Additionally, the purple points indicate outliers that were identified using other software, such as RDA or LFMM. The Y-axis represents the unadjusted p-value, which has been transformed using the -log10() function. I have also checked these plots, and everything seems to be okay. The plots include a histogram of p-values, a QQ plot, a histogram of Dj statistic, and loading scores plots.