Too many outliers - Githubissues

sandyplus commented 1 year ago

Hello, everyone,

I used pcadapt to identify environment-related outliers, but I obtained an excessive number of them. Is there anything I overlooked? Best regards, Sandy

PS: The code I used:

library(pcadapt)
library(qvalue)
library(data.table)

# Read input file name and best K value from command line arguments
infile <- "imputation.merged.lfmm"
posfile <- "imputation.merged.pos"
bestk  <- 2
outpre <- gsub(".lfmm","",basename(infile))
outdir <- "output"
dir.create(outdir, showWarnings = FALSE)

# Read data using read.pcadapt function with "lfmm" type
ddl <- read.pcadapt(infile, type = "lfmm")

# Read "pos" column from the input file and set column name to NULL
pos <- fread(posfile, header = F)
print(paste("dim of this input is", paste(as.character(dim(ddl)), collapse = "x")))

# Plot the first screeplot of the first division
pca <- pcadapt(ddl, K = 10, method = "mahalanobis", min.maf = 0.05)
pdf(paste0(outdir, "/", "outplots_", outpre, ".screeplot", ".K10", ".pdf"))
plot(pca, option = "screeplot")
dev.off()

# Perform pcadapt analysis with the best K value
pca     <- pcadapt(ddl, K = bestk, method = "mahalanobis", min.maf = 0.05)
pca$pos <- pos
pca$qvalue <- qvalue(pca$pvalues)

# Plot the histogram of p-values
outpdf  <- paste0(outdir, "/", "outplots_", outpre, ".hist", paste0(".K", bestk), ".pdf")
pdf(outpdf)
hist(pca$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
dev.off()

# Plot the QQ plot
pdf("output_sandy_Ccar2.5/Ccar.qqplot_k2.pdf")
plot(pca, option = "qqplot")
dev.off()

# Plot the histogram of Dj statistic
pdf("output_sandy_Ccar2.5/Ccar.Dj_hist.pdf")
plot(pca, option = "stat.distribution")
dev.off()

# Plot the loading scores for the top 2 principal components
pdf("output_sandy_Ccar2.5/Ccar.loading.pdf")
par(mfrow = c(2, 1))
for (i in 1:2)
  plot(pca$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
dev.off()

# Save the pcadapt results in outpca
# Merge pos, pvalues, and qvalues variables
out_dd <- data.frame("pos" = pca$pos, "pvalues" = pca$pvalues, "qvalues" = pca$qvalue$qvalues)
outpca = paste0(outdir, "/", "outplots_", outpre, paste0(".K", bestk), ".pcadapt.qvalue.txt")
write.table(out_dd, file = outpca, sep = "\t", quote = F, row.names = F, col.names = T)

saveRDS(pca, paste0(outpre, ".", bestk, ".RDS"))

Here is the R session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/software/R/4.0.3/lib64/R/lib/libRblas.so
LAPACK: /opt/software/R/4.0.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pcadapt_4.3.3      stringr_1.5.0      qvalue_2.22.0      RColorBrewer_1.1-2
[5] data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       magrittr_2.0.3   splines_4.0.3    tidyselect_1.2.0
 [5] munsell_0.5.0    colorspace_2.0-0 R6_2.5.0         rlang_1.1.0
 [9] fansi_0.4.1      plyr_1.8.6       dplyr_1.1.1      tools_4.0.3
[13] grid_4.0.3       gtable_0.3.0     utf8_1.1.4       cli_3.6.1
[17] tibble_3.2.1     lifecycle_1.0.3  reshape2_1.4.4   ggplot2_3.4.2
[21] vctrs_0.6.1      glue_1.6.2       stringi_1.5.3    compiler_4.0.3
[25] pillar_1.9.0     generics_0.1.0   scales_1.2.1     pkgconfig_2.0.3

Here is the plot output. The blue and light blue colored points indicate the outliers, while the black and grey colored points represent the non-outliers. The grey dashed line represents the threshold q-value of less than 0.01, and the red line represents the threshold Bonferroni adjusted p-value of less than 0.05. Additionally, the purple points indicate outliers that were identified using other software, such as RDA or LFMM. The Y-axis represents the unadjusted p-value, which has been transformed using the -log10() function. I have also checked these plots, and everything seems to be okay. The plots include a histogram of p-values, a QQ plot, a histogram of Dj statistic, and loading scores plots.

privefl commented 1 year ago

How many variants do you have as input? Have you checked https://github.com/bcm-uga/pcadapt/issues/56?

sandyplus commented 1 year ago

@privefl Thanks for your quick response.

I identified 247,899 SNPs as outliers using a Q-value threshold of < 0.01, and 37,743 using the Bonferroni adjusted p-value threshold out of 7,856,218 SNPs. However, there was little overlap with RDA and LFMM with the Bonferroni method.

I reviewed the issue and confirmed that I have enough variants for pcadapt to calculate the mahalanobis distance. Let me know if you need more information.

Best regards, Sandy

sandyplus commented 1 year ago

FYI, I divided the chromosome into 200 parts and merged them as input for pcadapt, which may have caused some misordering of SNPs. Should I sort the SNPs according to their respective chromosomes?

privefl commented 1 year ago

Ok, the number of variants is not the problem then.

The other obvious next issue can be LD. Are you capturing any LD in the PCA? But you seem to be using only 2 PCs, so that seems unlikely.. You can always try to use some pruning.

privefl commented 1 year ago

Any update on this?

sandyplus commented 1 year ago

@privefl I'm sorry for the delayed response. I was unable to detect any LD in the PCA. Additionally, I attempted to implement prune using command "pca <- pcadapt(ddl, K = bestk, method="mahalanobis",min.maf=0.05, LD.clumping = list(size = 200, thr = 0.1))", but the outcomes were almost identical with minor variations.

privefl commented 1 year ago

Then, maybe the results are fine. How many samples do you have?

You might want to increase the size in LD.clumping to e.g. 10000, given the large number of variants you have.

sandyplus commented 1 year ago

I have 79 individuals. Alright, I will attempt to increase the size in LD.clumping and update you with the results. Thanks for your suggestion.

privefl commented 1 year ago

That's a very small sample. Do you have populations that separate perfectly on the PC plot?

sandyplus commented 1 year ago

Yes, I have 16 populations which are separated by K = 2. So I used K = 2 for pcadapt analysis.

privefl commented 1 year ago

They are seperated in 3 groups clearly. Maybe results are just fine.

How does the histogram of pvalues look like?

sandyplus commented 1 year ago

It looks good.

privefl commented 1 year ago

What's the status on this issue?

bcm-uga / pcadapt

Too many outliers #81