bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

read.pcadapt function broken from Github version #8

Closed devonorourke closed 6 years ago

devonorourke commented 6 years ago

Was working with CRAN version initially and tried switching to Github version to see if there was any difference. Got the following error when trying to run through the vignette:

> path_to_file <- system.file("extdata", "geno3pops.lfmm", package = "pcadapt")
> filename <- read.pcadapt(path_to_file, type = "lfmm")
Error in .Call("_pcadapt_lfmm2pcadapt", PACKAGE = "pcadapt", input, output) : 
  "_pcadapt_lfmm2pcadapt" not available for .Call() for package "pcadapt"

Notably this only occurs with the Github version, not the CRAN package.

This similar error occurs when I try to import a test .vcf data set of my own which worked on the CRAN version.

Thanks

keurcien commented 6 years ago

Hi @devonorourke,

I'm not sure about what's going wrong here, as it works on my laptop with the GitHub version, with the same commands you provided. Can you tell me what OS you're running and what version of R you have ?

Anyway, we're actually trying to make a brand new version, which will be using .bed files, .pcadapt files or R matrices as input, avoiding us to implement these conversion functions which are less efficient than PLINK. Will keep you updated, that should come up pretty soon hopefully, by the end of 2017.

What's your thought on that by the way ? Are you familiar with PLINK and .bed files ? I find them very convenient as they are memory efficient (8 times smaller than a pcadapt file) and faster to access (binary vs ASCII). Not sure however about how popular PLINK is.

devonorourke commented 6 years ago

Tried this yesterday using the server on campus; thought that the compute cluster was possibly sending R library packages to my personal directory and perhaps splitting some dependencies in a shared R package folder for all users. This morning I tried installing on my laptop (local machine only, not connected to any server). Got the exact same error message with Github install.

I'm running

R version 3.4.2 (2017-09-28) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0.0 (64-bit)

And installing:

> devtools::install_github("bcm-uga/pcadapt")
Downloading GitHub repo bcm-uga/pcadapt@master
from URL https://api.github.com/repos/bcm-uga/pcadapt/zipball/master
Installing pcadapt
trying URL 'https://cran.rstudio.com/src/contrib/plotly_4.7.1.tar.gz'
Content type 'application/x-gzip' length 1034951 bytes (1010 KB)

PLINK files are fine and certainly clarify the separation of family/pedigree lists from the .vcf file type input. Nevertheless, I actually liked the ability to import directly from a .vcf because I was thinking of the various tools whereby I can filter and/or add annotations to the INFO field, with hopes that I could then use those annotations within your program to make more decisions about SNP outliers.

As one example, perhaps I wanted to see how quality filtering would impact my outlier detection within pcadapt - I could use snpSift (or vcftools, or bcftools, etc.) to do some filtering on the file, then re run the test within pcadapt. I'd be interested to know how sensitive the outlier detection is based on how strict my filtering criteria is (whatever it might be). If you move to a PLINK style input, I'm restricted to programs that filter a file based on that format or will have to recreate another PLINK file each time. Not a big deal, but one reason to keep an option in the .vcf file import available.

Another example would be to add annotations directly to the .vcf file and analyze the SNP outliers from pcdadapt using a program like snpEff. In this case I'd be interested in possible (a) filtering out, or (b) counting/sorting/tabling how many instances these outliers fall in things like:

zkamvar commented 6 years ago

Just as a second data point, I am also running OSX (but haven't upgraded to 10.13 yet) and I was unable to reproduce this error.

library("pcadapt")
path_to_file <- system.file("extdata", "geno3pops.lfmm", package = "pcadapt")
filename <- read.pcadapt(path_to_file, type = "lfmm")
#> Summary:
#> 
#>  - input file:               /Users/zhian/R/pcadapt/extdata/geno3pops.lfmm
#>  - output file:              /Users/zhian/R/pcadapt/extdata/geno3pops.pcadapt
#> 
#>  - number of individuals detected:   150
#>  - number of loci detected:      1500
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.2 (2017-09-28) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz America/Chicago #> date 2017-11-21 #> Packages ----------------------------------------------------------------- #> package * version date source #> ape 5.0 2017-10-30 CRAN (R 3.4.2) #> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0) #> backports 1.1.1 2017-09-25 CRAN (R 3.4.2) #> base * 3.4.2 2017-10-04 local #> bindr 0.1 2016-11-13 CRAN (R 3.4.0) #> bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0) #> cluster 2.0.6 2017-03-16 CRAN (R 3.4.0) #> colorspace 1.3-3 2017-08-16 R-Forge (R 3.4.1) #> compiler 3.4.2 2017-10-04 local #> data.table 1.10.4-3 2017-10-27 CRAN (R 3.4.2) #> datasets * 3.4.2 2017-10-04 local #> devtools 1.13.3 2017-08-02 CRAN (R 3.4.1) #> digest 0.6.12 2017-01-27 CRAN (R 3.4.0) #> dplyr 0.7.4 2017-09-28 CRAN (R 3.4.1) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1) #> ggplot2 2.2.1 2016-12-30 CRAN (R 3.4.0) #> glue 1.2.0 2017-10-29 CRAN (R 3.4.2) #> graphics * 3.4.2 2017-10-04 local #> grDevices * 3.4.2 2017-10-04 local #> grid 3.4.2 2017-10-04 local #> gtable 0.2.0 2016-02-26 CRAN (R 3.4.0) #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> htmlwidgets 0.9 2017-07-10 cran (@0.9) #> httr 1.3.1 2017-08-20 cran (@1.3.1) #> jsonlite 1.5 2017-06-01 CRAN (R 3.4.0) #> knitr 1.17 2017-08-10 cran (@1.17) #> lattice 0.20-35 2017-03-25 CRAN (R 3.4.0) #> lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.0) #> MASS 7.3-47 2017-04-21 CRAN (R 3.4.0) #> Matrix 1.2-11 2017-08-16 CRAN (R 3.4.1) #> memoise 1.1.0 2017-04-21 CRAN (R 3.4.0) #> methods * 3.4.2 2017-10-04 local #> mgcv 1.8-22 2017-09-19 CRAN (R 3.4.2) #> munsell 0.4.3 2016-02-13 CRAN (R 3.4.0) #> nlme 3.1-131 2017-02-06 CRAN (R 3.4.0) #> parallel 3.4.2 2017-10-04 local #> pcadapt * 3.1.0 2017-11-21 Github (bcm-uga/pcadapt@a64b2af) #> permute 0.9-4 2016-09-09 CRAN (R 3.4.0) #> pinfsc50 1.1.0 2016-12-02 CRAN (R 3.4.0) #> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0) #> plotly 4.7.1 2017-07-29 CRAN (R 3.4.1) #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) #> purrr 0.2.4 2017-10-18 cran (@0.2.4) #> R6 2.2.2 2017-06-17 cran (@2.2.2) #> Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2) #> RcppRoll 0.2.2 2015-04-05 CRAN (R 3.4.0) #> rlang 0.1.4 2017-11-05 CRAN (R 3.4.2) #> rmarkdown 1.7 2017-11-10 cran (@1.7) #> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) #> RSpectra 0.12-0 2016-06-12 cran (@0.12-0) #> scales 0.5.0.9000 2017-08-28 Github (hadley/scales@d767915) #> stats * 3.4.2 2017-10-04 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) #> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) #> tibble 1.3.4 2017-08-22 cran (@1.3.4) #> tidyr 0.7.2 2017-10-16 CRAN (R 3.4.2) #> tools 3.4.2 2017-10-04 local #> utils * 3.4.2 2017-10-04 local #> vcfR 1.5.0 2017-05-18 CRAN (R 3.4.0) #> vegan 2.4-4 2017-08-24 cran (@2.4-4) #> viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.0) #> withr 2.1.0 2017-11-01 CRAN (R 3.4.2) #> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0) ```
devonorourke commented 6 years ago

Really odd... Uninstalled everything, then:

  1. Installed from CRAN. I can run read.pcadapt by specifying my input .vcf file as type = "vcfR", but get an error if I specify type = "vcf". I can use that tmp.pcadapt file with pcadapt(input = ... K = ...) and generate an object with a List of 9. Interestingly, no positions.txt file is produced.

  2. Installed from Github. I can run read.pcadapt by specifying my input .vcf as type = "vcf", but get an error if I specify type = "vcfR". I can import the tmp.pcadapt file with pcadapt(input = ... K = ...) and generate an object with a List of 10. In this case, a positions.txt file is produced.

Not clear what's going on between versions here.

Question about the "positions.txt" file - is there a way to include position and chromosome/scaffold/contig if you input a .vcf file with more than one chromosome/scaffold/contig? It's find if you load data and don't get any errors/sites discarded, as the positions.txt file is just the order of the lines in a vcf file (which a little grep and cut pattern will sufficiently pull out). However, if you do get one or more sites discarded you can't necessarily determine what was thrown out so easily. It would be extremely helpful if the positions.txt file generated both of those bits of information (chromosome and base pair), rather than just base pairs.

Thanks!

keurcien commented 6 years ago

1 & 2. The CRAN version uses an old converter written in C that was faster...but not consistent and would raise errors quite often, because the parsing was mainly coded for a certain dataset (I don't remember which one it was, but probably 1000Genome). It was hard coded so some VCFs would not meet the requirements hard coded in the function, and that's why I completely gave up on this converter, and switched to vcfR which is now the current default (and only) option for VCF files. That's why you get an error with the GitHub version with option type = "vcfR" as it vcfR is called with type = "vcf". Though vcfR has quite a hard time with large files, when you have a VCF of size 2GB, I think it needs 4GB of RAM, so it's not really scalable. And I haven't found any satisfying alternative to it except using PLINK and basing the code on .bed files. But I will keep the dependency on vcfR as you suggested.

About the positions.txt file, I think that's a good idea, but will probably not output it as a file (trying to avoid writing too many files on disk), and store it in an object.

Anyway, thank you very much, I appreciate the time you spent for writing the suggestions to improve the package. I hope the new version will be more user friendly, by taking these into account.