Ella-Bowles commented 6 years ago

Hi Michael,

In reference to issue #23, I realized that I need to use the vcf because I need to get SNP IDs. Since I was having similar matrix issues, I modified my script to the following. It seems to work, but since I don't require markers to be present in all populations (the filter is something like 80% of pops), my scree plot is not nearly as clear. Thus, I just want to confirm that you agree with my PC choices/code here.

MY2 <- pcadapt::read.pcadapt("radiator_vcf_file_20181011@1107.vcf", type = "vcf")

x <- pcadapt(MY2,K=20)

plot(x,option="screeplot")

Scree_20181011.pdf

x<-pcadapt(MY2_pcadapt,K=2)

finding outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha)

Print the list of outlier SNPs require("vcfR") obj.vcfR <- vcfR::read.vcfR("radiator_vcf_file_20181011@1107.vcf") ID <- vcfR::getID(obj.vcfR) print(ID[outliers])

[1] "101998" "10372" "104168" "106023" "106272" "107429" "111635" "112013" "112424" "113626" "114268" "114340" [13] "114803" "115050" "117073" "117247" "11779" "118299" "120094" "121204" "121785" "122909" "123193" "124158" [25] "124450" "125202" "129890" "131385" "133410" "134685" "135443" "137676" "137816" "137971" "138181" "140914" [37] "142392" "144086" "14501" "146164" "147642" "147680" "148071" "148502" "149142" "149162" "151322" "151884" [49] "151943" "152766" "154957" "158934" "159555" "159758" "161121" "161423" "161849" "163537" "164746" "165396" [61] "165988" "16658" "167505" "18110" "18427" "18513" "20697" "21207" "21273" "21975" "22870" "23419" [73] "23573" "23922" "24507" "24660" "24810" "25537" "25728" "25845" "26214" "26374" "26613" "28130" [85] "28315" "28467" "29803" "30525" "30634" "30666" "30867" "30986" "31087" "31460" "31490" "31568" [97] "31592" "31676" "31698" "32077" "32413" "32591" "32696" "33963" "34116" "35039" "35647" "36093" [109] "36149" "36527" "37091" "37462" "37646" "38996" "39821" "40023" "40036" "41100" "41343" "42592" [121] "43104" "43316" "43646" "44098" "44248" "44591" "45024" "45363" "45561" "45604" "45741" "45745" [133] "45779" "45849" "46012" "47033" "47117" "48101" "48149" "49150" "49741" "49858" "50250" "51210" [145] "51734" "51920" "52095" "52374" "52616" "53129" "53654" "53939" "54255" "56406" "58169" "58317" [157] "58447" "59112" "59320" "59344" "60108" "60473" "60743" "61663" "62749" "62765" "62890" "63757" [169] "64077" "64764" "65083" "65101" "65299" "65549" "65838" "66417" "66451" "67065" "67467" "68652" [181] "68730" "68921" "69744" "7016" "70438" "70574" "7207" "73493" "73868" "73947" "74213" "74384" [193] "74583" "75382" "75485" "75662" "76075" "77337" "78455" "78758" "78934" "79000" "79028" "79431" [205] "79668" "79872" "80724" "81243" "81320" "81347" "82638" "82639" "82647" "83070" "84023" "85420" [217] "85559" "85692" "85792" "86890" "87509" "87846" "88060" "88509" "88638" "88941" "89130" "89812" [229] "90010" "90238" "90639" "91521" "92032" "92057" "92568" "92678" "92686" "93097" "93851" "94138" [241] "94139" "94734" "94856" "94924" "95202" "95389" "95466" "95754" "95765" "96590" "96921" "96942" [253] "96985" "97044" "97763" "97915"

VCF used is attached here radiator_vcf_file_20181011@1107.zip

With thanks,

mblumuga commented 6 years ago

Based on the scree plot, I would rather use K=8. However, if based on the score plot, only 2 axes are meaningful for the axes of differentiation you are interested in, it makes sense to look at K=2. BTW, you should have a look to the histogram of P-values to assess if calibration is OK.

privefl commented 6 years ago

Even K=9? You should also look at the loadings to see if they are normally distributed (without peaks).

Ella-Bowles commented 6 years ago

Thank you @mblumuga and @privelf. I looked at the scores plot, and see that up to 14 PCs contributes to variation between populations. However, using 8, 9 and 14 PCs give the same number of outliers. Strangely though, I find 820 outliers. This seems insanely high. When I look at the p-values, I see peaks at 0 (normal), but also at 1. Does this mean that we have not filtered the data correctly? Both the QQ and p-values plots are attached. MY_QQ_plot.pdf MY_pvalues.pdf

My code is as follows, in case you want it

require(pcadapt)

MY2 <- pcadapt::read.pcadapt("radiator_vcf_file_20181011@1107.vcf", type = "vcf")

Choosing the number of principal components.

x <- pcadapt(MY2,K=20)

plot(x,option="screeplot")

poplist.names <- c(rep("Blackfly",19),rep("BobsCove",29),rep("Cripple",28),rep("Ditchy",18),rep("Freshwater",29),rep("Hermitage",22),rep("LowerCoquita",22),rep("LowerOBeck",19),rep("OBeck",20),rep("Perdition",14),rep("STBC",30),rep("UCoquita",18),rep("Watern",31),rep("WhaleCove",25))

plot(x, option = "scores", pop = poplist.names)

plot(x, option = "scores", i = 3, j = 4, pop = poplist.names)

plot(x, option = "scores", i = 5, j = 6, pop = poplist.names)

plot(x, option = "scores", i = 7, j = 8, pop = poplist.names)

plot(x, option = "scores", i = 9, j = 10, pop = poplist.names)

plot(x, option = "scores", i = 11, j = 12, pop = poplist.names)

plot(x, option = "scores", i = 13, j = 14, pop = poplist.names)

plot(x, option = "scores", i = 15, j = 16, pop = poplist.names)

plot(x, option = "scores", i = 17, j = 18, pop = poplist.names)

plot(x, option = "scores", i = 19, j = 20, pop = poplist.names)

$K=14$ is the good option.

Without removal of outliers --just showing code with k=8 here, but did with k = 9 and 14 as well

x2<-pcadapt(MY2,K=8)

plot(x2, option="manhattan")

plot(x2, option = "qqplot", threshold = 0.1)

hist(x2$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")

plot(x2, option = "stat.distribution")

finding outliers

http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("qvalue")

require(qvalue) qval2 <- qvalue(x2$pvalues)$qvalues alpha1 <- 0.1 outliers2 <- which(qval<alpha)

Print the list of outlier SNPs require("vcfR") obj.vcfR2 <- vcfR::read.vcfR("radiator_vcf_file_20181011@1107.vcf") ID2 <- vcfR::getID(obj.vcfR) print(ID[outliers2])

With thanks,

mblumuga commented 6 years ago

When I look at the p-values, I see peaks at 0 (normal), but also at 1

Nothing to worry, it might be SNPs with low maf or uniformly distributed across the landscape.

I cannot say if 820 is large or small but be aware that genome scan approaches are known to have a large proportion of False Positives. If you have well defined populations, I suggest you run FLK in addition to pcadapt. Then you have 2 strategies, you can take either

the intersection of outliers to be conservative or
combine pvalues using Fisher's methods and recalibrate p-values using genomic inflation factors Fisher_stat <- (-2)*(log(pval_pcadapt) + log(pval_flk)) gif <- median(Fisher_stat)/qchisq(0.5,df=4) pval_Fisher <- pchisq(Fisher_stat/gif,df=4,lower.tail=F)

Hope it helps

Ella-Bowles commented 6 years ago

Thanks very much @mblumuga!

privefl commented 6 years ago

I don't think there is anything wrong with your results. It seems that you have a lot of structure, defined by a lot of SNPs. The over-representation of 1 in the histogram is normal. If you have many non-null SNPs, the genomic inflation factor will be inflated. See e.g. this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137506/. So, I think the results are conservative here.

The question is: is it normal to have so much structure in your data?

privefl commented 6 years ago

For example, try adding 50,000 null SNPs:

mat <- pcadapt::bed2matrix(MY2)
# Verif
hist(colMeans(mat, na.rm = TRUE))
hist(colMeans(is.na(mat)))
hist(rowMeans(is.na(mat)))

# adding some null SNPs
N <- nrow(mat)
M <- 50e3
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {
  rbinom(N, size = 2, prob = af)
})
mat2 <- cbind(mat, mat_null)
MY2 <- read.pcadapt(mat2, type = "lfmm")

You'll get a more uniform p-value histogram. (It's even better with 200,000)

bcm-uga / pcadapt

confirming that result is sound #24

finding outliers

Choosing the number of principal components.

Without removal of outliers --just showing code with k=8 here, but did with k = 9 and 14 as well

finding outliers