Too many NA values - Githubissues

bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4

https://bcm-uga.github.io/pcadapt

39 stars 10 forks source link

Too many NA values #90

Closed Lin012345 closed 5 months ago

Lin012345 commented 5 months ago

Dear all， I am using pcadapt to identify loci under natural selection in the population, but there are a large number of NA values. I would like to know what might have caused this, whether it is related to the quality of my data, or if there are other aspects that I have overlooked. Best regards

privefl commented 5 months ago

I'm not sure what I am looking at. Could you give more explanation please?

Lin012345 commented 5 months ago

I apologize for not providing detailed information. I have 679,602 loci, and after calculating the p-values with pcadapt, I found that there are a large number of p-values that were not calculated. plot_m_clean<- na.omit(plot_m) removed 299,224 rows with NA values. With the most sincere thanks and the code I used

setwd("D:/data/10_pcadapt/all")
library(pcadapt)
library(data.table)
path_to_file <- "3group.bed"
filename <- read.pcadapt(path_to_file, type = "bed")
poplist.names <- c(rep("1", 104),rep("2", 100),rep("3", 31))
print(poplist.names)
y <- pcadapt(input = filename, K = 20) 
plot(y, option = "scores",i=1,j=2,pop = poplist.names)
plot(y, option = "screeplot")

y <- pcadapt(input = filename, K = 2) 
plot(y, option = "scores", pop = poplist.names)
plot(y,option = "manhattan")
par(mfrow = c(2, 2))
for (i in 1:2)
  plot(y$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
library(qvalue)
y$qvalue <- qvalue(y$pvalues)
hist(y$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "#FFD369")
plot(y, option = "qqplot")
plot(y, option = "stat.distribution")

library(qvalue)
qval <- qvalue(y$pvalues)$qvalues
alpha <- 0.01
outliers <- which(qval < alpha)
length(outliers)
significant_markers <- logical(length(pca$pvalues))
significant_markers[outliers] <- TRUE
y$significant_markers <- significant_markers

pos <- fread("3group.bim", header = F)
plot_m <- data.frame("SNP" = pos$V2, "CHR" = pos$V1, "BP" = pos$V4, "pvalues" = y$pvalues)
n_before <- nrow(plot_m)
plot_m_clean <- na.omit(plot_m)
n_after <- nrow(plot_m_clean)
cat("Removed", n_before - n_after, "rows with NA values\n")
write.table(plot_m_clean, file = "3group.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

privefl commented 5 months ago

The ones for which you do not get p-values should be the ones with a MAF that is too low.

xiangping2023 commented 5 months ago

The ones for which you do not get p-values should be the ones with a MAF that is too low.

After we use the MAF>0.05 standard for filtering, the SNP of 46W still has 11W of NA, how can we solve this situation?

privefl commented 5 months ago

Could you send me the resulting pcadapt object (as an RDS file).

Lin012345 commented 5 months ago

Could you send me the resulting pcadapt object (as an RDS file).

Thank you very much for your reply. I have discovered that I cannot upload an RDS file here. so I've compressed it instead, but I am not sure if that is what you want. If there is anything else you need, please let me know. Again, thank you very much for your patience. group_maf.zip

privefl commented 5 months ago

Can't download it for some reason. Send me the RDS file to florian.prive.21@gmail.com.

privefl commented 5 months ago

> res <- readRDS("tmp-data/group_maf.rds")
> str(res)
List of 11
 $ scores         : num [1:235, 1:2] -0.00335 -0.04187 -0.04297 -0.03627 -0.03708 ...
 $ singular.values: num [1:2] 0.1026 0.0817
 $ loadings       : num [1:676053, 1:2] 0.00155 -0.000789 NA NA 0.001033 ...
 $ zscores        : num [1:676053, 1:2] 1.569 -0.75 NA NA 0.993 ...
 $ af             : num [1:676053] 0.94 0.756 0.972 1 0.915 ...
 $ maf            : num [1:676053] 0.0596 0.2436 0.0277 0 0.0851 ...
 $ chi2.stat      : num [1:676053] 4.468 0.989 NA NA 1.569 ...
 $ stat           : num [1:676053] 6.06 1.34 NA NA 2.13 ...
 $ gif            : num 1.36
 $ pvalues        : num [1:676053] 0.107 0.61 NA NA 0.456 ...
 $ pass           : int [1:379265] 1 2 5 8 12 15 16 20 21 22 ...
 - attr(*, "K")= num 2
 - attr(*, "method")= chr "mahalanobis"
 - attr(*, "min.maf")= num 0.05
 - attr(*, "class")= chr "pcadapt"

From what I can see, e.g. variants 3 & 4 have indeed been removed due to the MAF threshold.

Lin012345 commented 5 months ago

> res <- readRDS("tmp-data/group_maf.rds")
> str(res)
List of 11
 $ scores         : num [1:235, 1:2] -0.00335 -0.04187 -0.04297 -0.03627 -0.03708 ...
 $ singular.values: num [1:2] 0.1026 0.0817
 $ loadings       : num [1:676053, 1:2] 0.00155 -0.000789 NA NA 0.001033 ...
 $ zscores        : num [1:676053, 1:2] 1.569 -0.75 NA NA 0.993 ...
 $ af             : num [1:676053] 0.94 0.756 0.972 1 0.915 ...
 $ maf            : num [1:676053] 0.0596 0.2436 0.0277 0 0.0851 ...
 $ chi2.stat      : num [1:676053] 4.468 0.989 NA NA 1.569 ...
 $ stat           : num [1:676053] 6.06 1.34 NA NA 2.13 ...
 $ gif            : num 1.36
 $ pvalues        : num [1:676053] 0.107 0.61 NA NA 0.456 ...
 $ pass           : int [1:379265] 1 2 5 8 12 15 16 20 21 22 ...
 - attr(*, "K")= num 2
 - attr(*, "method")= chr "mahalanobis"
 - attr(*, "min.maf")= num 0.05
 - attr(*, "class")= chr "pcadapt"

From what I can see, e.g. variants 3 & 4 have indeed been removed due to the MAF threshold.

I understand, your suggestions are very helpful, thank you very much.