bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

follow-up to issues #23 and #25, I can't visualize score plots with and without outliers #27

Closed Ella-Bowles closed 5 years ago

Ella-Bowles commented 5 years ago

This issue is I think the last piece of issues #23 and #25 (both closed issues) that I have been discussing here.

The following code works fine for getting outliers:

require(pcadapt)

path_to_file <- "D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf" Chali_matrix <- read.pcadapt(path_to_file,type="vcf")

Choosing the number of principal components.

x <- pcadapt(Chali_matrix,K=20) plot(x,option="screeplot")

K=2 is the good option.

poplist.names <- c(rep("Cha03",30), rep("Cha15",49)) x<-pcadapt(x,K=2) require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha)

plot(x,option="scores",pop=poplist.names)

Print the list of outlier SNPs

require("vcfR") obj.vcfR <- vcfR::read.vcfR("ChaliNoOddsWithOuts.vcf") ID <- vcfR::getID(obj.vcfR) print(ID[outliers])

However, I need to be able to visualize the score plots with and without outliers, which I used to be able to do using the following code that MBlum wrote. This code is now broken (Issue #25), which happened after some update I believe, though not sure if it was R, Rstudio, etc... I note below where the errors occur. require(pcadapt)

path_to_file <- "D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf" Chali_matrix <- read.pcadapt(path_to_file,type="vcf")

Choosing the number of principal components.

x1 <- pcadapt(Chali_matrix,K=20) plot(x,option="screeplot")

K=2 is the good option.

Load the data matrix in R (useful to remove outliers)

data <- as.matrix(read.table("ChaliNoOddsWithOuts.pcadapt"))

Compare pop. structure with and without outliers

poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x<-pcadapt(data, K=2)

Error in UseMethod("pcadapt") : no applicable method for 'pcadapt' applied to an object of class "pcadapt"

With removal of outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) xr<-pcadapt(data[-outliers,],K=2)

Error in UseMethod("pcadapt") : no applicable method for 'pcadapt' applied to an object of class "c('matrix', 'integer', 'numeric')"

did not run rest of this

par(mfrow=c(1,2)) plot(x,option="scores",pop=poplist.names) plot(xr,option="scores",pop=poplist.names)

Print the list of outlier SNPs

require("vcfR") obj.vcfR <- vcfR::read.vcfR("ChaliNoOddsWithOuts.vcf") ID <- vcfR::getID(obj.vcfR) print(ID[outliers])

I have tried to implement the following solution, proposed in issue #25 , which is now closed. But it doesn’t work, I still get the following error message. Note that I use the temporary PCAdapt file which is created now in my C drive, instead of the PCAdapt file that used to be created and put in the same directory that I am working in.

Will only include relevant sections of code.

data <- as.matrix(read.table("file2c1c47966ff6.pcadapt")) toto <- read.pcadapt(data,type="pcadapt")

Compare pop. structure with and without outliers

poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x<-pcadapt(toto,K=2)

With removal of outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) xr<-pcadapt(toto[-outliers,],K=2)

Error in UseMethod("pcadapt") : no applicable method for 'pcadapt' applied to an object of class "c('matrix', 'integer', 'numeric')"

Any thoughts?

With many thanks, Ella

privefl commented 5 years ago

Please format your code by surrounding it with three backticks. And try to provide a minimal reproducible example (less code, with the data).

I think this is the same issue as in the other issues. You need to use read.pcadapt() before using pcadapt().

Ella-Bowles commented 5 years ago

I did use read.pcadapt before using pcadapt, but still got the errors. From the last part of my message, here is the code from after I selected the number of PCs to retain. I also attach the PCAdapt file as well as the input vcf. ChaliNoOddsWithOuts.zip

'''data <- as.matrix(read.table("file2c1c47966ff6.pcadapt"))''' '''toto <- read.pcadapt(data,type="pcadapt")'''

Compare pop. structure with and without outliers

'''poplist.names <- c(rep("Cha03",30), rep("Cha15",49))'''

Without removal of outliers

'''x<-pcadapt(toto,K=2)'''

With removal of outliers

'''require(qvalue)''' '''qval <- qvalue(x$pvalues)$qvalues''' '''alpha <- 0.1''' '''outliers <- which(qval<alpha)''' '''xr<-pcadapt(toto[-outliers,],K=2)'''

Error in UseMethod("pcadapt") : no applicable method for 'pcadapt' applied to an object of class "c('matrix', 'integer', 'numeric')"

privefl commented 5 years ago

Basically, when you're subsetting the pcadapt_matrix using toto[-outliers,], it becomes a simple matrix again. Because the subsetting is defined only for a matrix, so you get a matrix back.

data <- matrix(1:6, 2)
class(data)        ## matrix
toto <- pcadapt::read.pcadapt(data)
class(toto)        ## pcadapt_matrix
class(toto[-1, ])  ## matrix
Ella-Bowles commented 5 years ago

Am I meant to use the following then? It still gives me the error. .... toto <- pcadapt::read.pcadapt(data) xr<-pcadapt(toto[-outliers,],K=2) ...

privefl commented 5 years ago

You need to use

toto_subset <- read.pcadapt(toto[-outliers, ], type = "lfmm")
xr <- pcadapt(toto_subset, K = 2)

Type "lfmm" means that rows of your matrix are observations and columns are SNPs. By default, it's type "pcadapt" which is the transpose of "lfmm".

Ella-Bowles commented 5 years ago

It still isn't working. I really hope this isn't just me.

Given that I have already converted the vcf to a pcadapt file and figured out the number of outliers to retain, I run `data <- as.matrix(read.table("file2c1c47966ff6.pcadapt")) toto <- read.pcadapt(data,type="pcadapt")

poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x<-pcadapt(toto,K=2)

With removal of outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) toto_subset <- read.pcadapt(toto[-outliers, ], type = "lfmm") xr <- pcadapt(toto_subset, K = 2)

par(mfrow=c(1,2)) plot(x,option="scores",pop=poplist.names) plot(xr,option="scores",pop=poplist.names)`

And I get

Error in $<-.data.frame(*tmp*, "Pop", value = c("Cha03", "Cha03", : replacement has 79 rows, data has 62

And the plot that I get out of plot(x,option="scores",pop=poplist.names) is very strange. I attach it here x plot with outlier included_weird

Something seems to be weird with the data <- as.matrix(read.table("file2c1c47966ff6.pcadapt")) step.

I say this, because if I run the following code (skipping the step to get the num of PCs to retain, b/c I've already done it a million times) path_to_file <- "D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf" Chali_matrix <- read.pcadapt(path_to_file,type="vcf") `poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x<-pcadapt(Chali_matrix,K=2) require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) par(mfrow=c(1,2)) plot(x,option="scores",pop=poplist.names)` I get a very normal looking plot with outliers included, attached x with outliers included--normal

But using this last way, I still can't get the xr step to work.

Ack....

Ella-Bowles commented 5 years ago

I'll attach my vcf file here in case you want to just try to run it ChaliNoOddsWithOuts (2).zip

privefl commented 5 years ago

1/ You need to use toto[, -outliers]. SNPs (variables) are columns. Because it's more efficient, and it's the convention for many fields of data science.

2/ There is clearly something wrong with a plot where PC1 takes only positive values. The problem is that you have missing values encoded as 9 in your data. Instead of reading the pcadapt file yourself, please use bed2matrix() instead.

Ella-Bowles commented 5 years ago

Thank you, and trying your solutions.

Given the path where my pcadapt files are written, which is below the code that I use to read in and convert the file below `> path_to_file <- "D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf"

Chali_matrix <- read.pcadapt(path_to_file,type="vcf")` No variant got discarded. Summary:

- input file:               D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf
- output file:              C:\Users\Admin\AppData\Local\Temp\Rtmp6rrGdS\file3a8c8bc661.pcadapt

- number of individuals detected:   79
- number of loci detected:      8513

8513 lines detected. 79 columns detected.

For bed2matrix I first try: try <- bed2matrix("C:/Users/Admin/AppDataLocal/Temp/Rtmp6rrGdS/file3a8c8bc661.pcadapt.bed")

And get "Error in file2other(input, type, match.arg(type.out), match.arg(allele.sep)) : File C:/Users/Admin/AppDataLocal/Temp/Rtmp6rrGdS/file3a8c8bc661.pcadapt.bed does not exist."

So, I go to where my pcadapt file is output and copy the bed file into my current directory. I then do the following: data <- bed2matrix(file3a8c8bc661.pcadapt.bed)

And get Error in read.pcadapt(bedfile, type = "bed") : object 'file3a8c8bc661.pcadapt.bed' not found What am I doing wrong?

Even though dir() clearly shows that the file is in my working directory.

Thoughts?

privefl commented 5 years ago

Simply try bed2matrix(path_to_file).

Ella-Bowles commented 5 years ago

If you mean for me to use this on my initial read in, so path_to_file <- "D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf"

And then bed2matrix(path_to_file)

My initial file is not a binary PED file, and so I get an error telling me this.

If you mean for me to use bed2matrix("C:/Users/Admin/AppDataLocal/Temp/Rtmp6rrGdS/file3a8c8bc661.pcadapt.bed")

I did this trying both using the quotes and not using the quotes. Using the quotes, which I think is the correct way to input a filepath, I get Error in data.table::fread(sub("\.bed$", ".bim", input)) : File 'C:/Users/Admin/AppData/Local/Temp/Rtmp4GhfHO/file187ce73303a.pcadapt.bim' does not exist or is non-readable. getwd()=='D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls'

I also tried path_to_file2 <- "C:/Users/Admin/AppData/Local/Temp/Rtmp4GhfHO/file187ce73303a.pcadapt.bed" bed2matrix(path_to_file2)

And got Error in data.table::fread(sub("\.bed$", ".bim", input)) : File 'C:/Users/Admin/AppData/Local/Temp/Rtmp4GhfHO/file187ce73303a.pcadapt.bim' does not exist or is non-readable. getwd()=='D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls'

I feel a bit like we're chasing something that just isn't working. Is there any chance that you could try to work the following on my vcf file to see if it works on your end? This is the original script that was working this summer. ChaliNoOddsWithOuts.zip

`require(pcadapt)

path_to_file <- "D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls/ChaliNoOddsWithOuts.vcf" Chali_matrix <- read.pcadapt(path_to_file,type="vcf") No variant got discarded. Summary:

Choosing the number of principal components.

x <- pcadapt(Chali_matrix,K=20) plot(x,option="screeplot")

$K=2$ is the good option.

Load the data matrix in R (useful to remove outliers)

data <- as.matrix(read.table("ChaliNoOddsWithOuts.pcadapt"))

Compare pop. structure with and without outliers

poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x<-pcadapt(data,K=2)

With removal of outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) xr<-pcadapt(data[-outliers,],K=2) par(mfrow=c(1,2)) plot(x,option="scores",pop=poplist.names) plot(xr,option="scores",pop=poplist.names)

Conclusion: population structure changes when you remove outliers.

Print the list of outlier SNPs

require("vcfR") obj.vcfR <- vcfR::read.vcfR("ChaliNoOddsWithOuts.vcf") ID <- vcfR::getID(obj.vcfR) print(ID[outliers])`

privefl commented 5 years ago

If you're not happy with the new version, feel free to install an older version from either GitHub or CRAN.

mblumuga commented 5 years ago

@Ella-Bowles Did you try filename <- read.pcadapt(path_to_file, type = "vcf") Otherwise what about converting your vcf to a ped or a bed using vcftools and then using pcadapt?

privefl commented 5 years ago

With your data:

vcf <- "tmp-data/ChaliNoOddsWithOuts.vcf"
bed <- pcadapt::read.pcadapt(vcf, type = "vcf")
mat <- pcadapt::bed2matrix(bed)
dim(mat)  
# 79 x 8513
table(mat, exclude = NULL)
#      0      1      2   <NA> 
# 457085 144253  31061  40128 
mblumuga commented 5 years ago

@Ella-Bowles : Did you manage to use the script written by @privefl ?

Ella-Bowles commented 5 years ago

Ack, I tried, but still get errors.

When I try `> vcf <- "ChaliNoOddsWithOuts.vcf"

bed <- pcadapt::read.pcadapt(vcf, type = "vcf") mat <-pcadapt::bed2matrix(bed)`

I get

From the read.pcadapt step No variant got discarded. Summary:

- input file:               ChaliNoOddsWithOuts.vcf
- output file:              C:\Users\Admin\AppData\Local\Temp\Rtmp2VKEtd\file2c8810883b77.pcadapt

- number of individuals detected:   79
- number of loci detected:      8513

8513 lines detected. 79 columns detected.

and then, from the bed2matrix step Error in read.pcadapt(bedfile, type = "bed") : Input should be a file path or a matrix-like object.

As I would expect, given where the file is being outputted.

When I try > mat <- pcadapt::bed2matrix("C:/Users/Admin/AppData/Local/Temp/Rtmp2VKEtd/file2c8856dd5add.pcadapt.bed")

I get Error in data.table::fread(sub("\.bed$", ".bim", input)) : File 'C:/Users/Admin/AppData/Local/Temp/Rtmp2VKEtd/file2c8856dd5add.pcadapt.bim' does not exist or is non-readable. getwd()=='D:/Documents/PostDoc/Walleye/stacksOutptAndAnalys/stacks_r12/sta_r12_pairwise_WithoutOddBalls'

Do you think I should try some different installation of R studio or PCAdapt, or can you see a different solution? Currently I am using PCAdapt v4.0.3 and R v3.55.1.

My apologies for all the back-and-forth. And many thanks for your help.

Ella

mblumuga commented 5 years ago

Can you send me a link to the vcf file or to a vcf that contains a small subset of the data?

Ella-Bowles commented 5 years ago

Sure, here you go ChaliNoOddsWithOuts (2).zip

mblumuga commented 5 years ago
require(pcadapt)
data <- read.pcadapt("ChaliNoOddsWithOuts.vcf",type="vcf",type.out="matrix")
dim(data)
[1]   79 8513
#remove loci 100 to 110
data <- data[,-c(100:110)]
dim(data)
[1]   79 8502
#Because individuals are in lines and SNPs in columns use type="lfmm"
genotypes <- read.pcadapt(data,type="lfmm")
res <- pcadapt(genotypes,K=20) 
Ella-Bowles commented 5 years ago

Thank you, and I think we're almost there. I got what you have given me to work, using the reduced dataset that you generate here, and I think I just need to sort out the structure of one more command. Okay, so I run: `require(pcadapt) data <- read.pcadapt("ChaliNoOddsWithOuts.vcf",type="vcf",type.out="matrix") dim(data)

remove loci 100 to 110

data <- data[,-c(100:110)] dim(data)

Because individuals are in lines and SNPs in columns use type="lfmm"

genotypes <- read.pcadapt(data,type="lfmm")

res <- pcadapt(genotypes,K=20) plot(res, option ="screeplot")

k=2 is the good option

Compare pop. structure with and without outliers

poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x <- pcadapt(genotypes,K=2)

With removal of outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) xr<-pcadapt(genotypes[,-outliers],K=2)`

And all is well until this last command. Here I get the error Error in UseMethod("pcadapt") : no applicable method for 'pcadapt' applied to an object of class "c('matrix', 'integer', 'numeric')"

I think the [,-outliers] should be correct here, given that the structure of the genotypes matrix should be individuals in rows and SNPs in columns. Thoughts?

privefl commented 5 years ago

I've updated the package so that the error you get is more explicit.

Ella-Bowles commented 5 years ago

AMAZING, IT WORKS!!

Just to confirm, the following looks good to you?! `require(pcadapt) data <- read.pcadapt("ChaliNoOddsWithOuts.vcf",type="vcf",type.out="matrix")

Because individuals are in lines and SNPs in columns use type="lfmm"

genotypes <- read.pcadapt(data,type="lfmm")

res <- pcadapt(genotypes,K=20) plot(res, option ="screeplot")

Compare pop. structure with and without outliers

poplist.names <- c(rep("Cha03",30), rep("Cha15",49))

Without removal of outliers

x <- pcadapt(genotypes,K=2)

With removal of outliers

require(qvalue) qval <- qvalue(x$pvalues)$qvalues alpha <- 0.1 outliers <- which(qval<alpha) x1<-read.pcadapt(data[,-outliers],type="lfmm") xr<-pcadapt(x1,K=2)`

mblumuga commented 5 years ago

Glad that you managed to make it work.