bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

errors when I read a pcadapt file, and also when I try to convert to a matrix #23

Closed Ella-Bowles closed 5 years ago

Ella-Bowles commented 5 years ago

Hi Michael,

I'm running PCAdapt on a colleague's data, and I am getting a couple of errors that I am wondering if you have seen before. The data are GBS dtata from 12 different populations. Lots and lots of missing data in this dataset, with ~1500 shared markers between populations. As for processing, post-stacks processing was done in radiator, and I generated a PCAdapt file using that package. write_pcadapt("brook_char_tidy_maf.tsv", pop.select = c("1","2","3","4","5","6","7","8","9","10","11","14","16","17"), snp.ld = NULL, maf.thresholds = NULL, filename = NULL, parallel.core = parallel::detectCores() - 1)

pop.select: Using markers common in all populations: Number of markers before/blacklisted/after:4614/3067/1547 Scanning for monomorphic markers... Number of markers before/blacklisted/after: 1547/0/1547 writing pcadapt file with: Number of populations: 14 Number of individuals: 324 Number of markers: 1547

I attach all the files, including the tidy, and PCAdapt files.

However, when I try to read in to PCAdapt, I get the following error: MY <- pcadapt::read.pcadapt("radiator_pcadapt_20181002@1301.txt", type = "pcadapt")

1547 lines detected. 324 columns detected. Warning message: Only one 'return' characters detected, yet Windows

However, I checked to make sure that the output file is in unix format, and it reads as only ASCII characters. I am running R and PCAdapt on a PC however. I also tried uninstalling and re-installing PCAdapt, and my result hasn't changed.

I can still do a bunch of stuff with the files generated, including looking at the scree plot

And I was able to see that outlier loci exist using qvalue. However, when i execute the following, so that I can do the full outlier analysis data <- as.matrix(read.table("radiator_pcadapt_20181002@1301.txt.bed"))

I get: Warning messages: 1: In read.table("radiator_pcadapt_20181002@1301.txt.bed") : line 1 appears to contain embedded nulls 2: In read.table("radiator_pcadapt_20181002@1301.txt.bed") : line 2 appears to contain embedded nulls 3: In read.table("radiator_pcadapt_20181002@1301.txt.bed") : line 3 appears to contain embedded nulls 4: In read.table("radiator_pcadapt_20181002@1301.txt.bed") : incomplete final line found by readTableHeader on 'radiator_pcadapt_20181002@1301.txt.bed' 5: In read.table("radiator_pcadapt_20181002@1301.txt.bed") : incomplete final line found by readTableHeader on 'radiator_pcadapt_20181002@1301.txt.bed'

And so there seems to be something wrong with the files.

I'm attaching the same files here as I attached to the email. brook_char.zip

Any thoughts?

With thanks, Ella

PS Also, if am assessing outliers among a number of populations (say i use 14 pops), is it important to use only loci that present in all the populations? My understanding is that it may depend on ones question, but that it is best if the majority of markers are in all populations so that the PCA can place populations correctly.

privefl commented 5 years ago

The main problem is that you're trying to read the .bed file (binary PLINK file generated by pcadapt in new versions) rather than the .txt file.

Concerning missing values, I don't think pcadapt could detect outlier SNPs that have data only for one pop. But I could be wrong. Any thought @mblumuga? One sure thing is that you should filter when there are too many missing values (e.g. > 50% or something).

mblumuga commented 5 years ago

Since it is not a large data file, you do not have to use bed file.

I suggest you read the file as a matrix MY <- read.table("radiator_pcadapt_20181002@1301.txt") then you use the read.pcadapt function MY_pcadapt <- read.pcadapt(MY, type = "pcadapt") Doing that, you can run the "full outlier analysis" (I do not what it is) using MY which is a standard R data frame.

Ella-Bowles commented 5 years ago

Thank you!