NESCent / r-popgen-hackathon

Population Genetics Hackathon, to be held at NESCent on March 16-20, 2015
18 stars 2 forks source link

better support for VCF files #14

Open emmanuelparadis opened 9 years ago

emmanuelparadis commented 9 years ago

I already mentioned this in the wiki, but I can develop a bit more here. It seems that VCF will be the format of the future for pop gen. Several packages offer support for VCF files but, IMHO, they are limited and not well integrated with other packages (I exclude the package on BioConductor to read variant annotations which has another objective). pegas has read.vcf but it is limited in its performance. I propose to improve the situation. These last days, I've played with the 1000 Genomes data for chr. Y (62,042 loci). I tried to set code for scannning the whole file and finding which loci are "real" SNP (ie, loci with only 2 alleles different by a single substition): this takes ~3 sec. Then it'd be possible to read selectively some loci (read.vcf can do that already). The approach is to use only R code to avoid relying on external libraries (it'll be probably necessary to write some code in C at some stage). For instance, this code to find SNPs is ~ 20 lines.

An issue is the file size we should expect. Are VCF files from pop-gen studies as big as those output by the 1000 Genomes project? Currently, handling files with 10,000's loci is not a problem, and I believe it should be relatively easy to reach 100,000's loci. Should we expect to handle files/data with 10^6 loci?

jgx65 commented 9 years ago

Agreed with you Emmanuel, VCF is the way to go, but we will need parsers from and to other formats, including allele frequencies. I think we should work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already quite common. Perhaps playing with chr1 of the 1000 project rather than ChrY would put us in the right ball park?

thibautjombart commented 9 years ago

Sounds exciting! I agree VCF seems to be the way to go. 10^5 SNPs x 100s of individuals also sounds more like the kind of dataset I am used to. I think we can assume that anything substantially bigger, e.g. large collections of full eucaryote genomes, will be handled on computer clusters and pose a different problem. Cheers Thibaut

On Wed, Mar 11, 2015 at 9:11 AM, Jerome Goudet notifications@github.com wrote:

Agreed with you Emmanuel, VCF is the way to go, but we will need parsers from and to other formats, including allele frequencies. I think we should work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already quite common. Perhaps playing with chr1 of the 1000 project rather than ChrY would put us in the right ball park?

— Reply to this email directly or view it on GitHub https://github.com/NESCent/r-popgen-hackathon/issues/14#issuecomment-78227805 .

grunwald commented 9 years ago

I am totally on board. We need better vcf tools for larger data sets that allow data for haploid, diploid or polyploid SNPs, ability to do genome scans for Fst, pi, etc. and gene content obtained from gff.

grunwald commented 9 years ago

I am totally on board. We need good vcf tools for larger GBS data sets that allow data for haploid, diploid or polyploid SNPs, ability to do genome scans with varying window sizes for Fst, pi, linkage, etc. and gene content obtained from gff. I like the idea of using moderate data matrices rather than whole genomes. Will this be based on the genlight object? Ability to parse to loci and genind is important.

One issue that we are struggling with is that likelihood that each individual is a new multilocus genotype (MLG) requires tools for defining MLG boundaries. How many different SNPs does it take to be a different genotype? Once we scale to 1,000s of SNPs aan MLG is very different to what we are used to in traditional data such as SSR. This is important for my group as we need understand if there are clones. How many SNP differences make it another clone, etc.

I like the idea of starting with 10^6 loci using chr1 of 1000 genomes as a starting point.

Cheers, Nik

On Mar 11, 2015, at 3:18 AM, Thibaut Jombart notifications@github.com wrote:

Sounds exciting! I agree VCF seems to be the way to go. 10^5 SNPs x 100s of individuals also sounds more like the kind of dataset I am used to. I think we can assume that anything substantially bigger, e.g. large collections of full eucaryote genomes, will be handled on computer clusters and pose a different problem. Cheers Thibaut

On Wed, Mar 11, 2015 at 9:11 AM, Jerome Goudet notifications@github.com wrote:

Agreed with you Emmanuel, VCF is the way to go, but we will need parsers from and to other formats, including allele frequencies. I think we should work on being able to handle 10^6 loci, 10^4 is routine and 10^5 already quite common. Perhaps playing with chr1 of the 1000 project rather than ChrY would put us in the right ball park?

— Reply to this email directly or view it on GitHub https://github.com/NESCent/r-popgen-hackathon/issues/14#issuecomment-78227805 .

— Reply to this email directly or view it on GitHub https://github.com/NESCent/r-popgen-hackathon/issues/14#issuecomment-78237017.

Niklaus J. Grünwald Research Plant Pathologist | Horticultural Crops Research Laboratory | USDA ARS Professor (courtesy) | Department of Botany and Plant Pathology | Oregon State University 3420 NW Orchard Ave. | Corvallis, OR 97330 | USA | Tel 541.738-4049 | Fax 541.738-4025 grunwaldlab.cgrb.oregonstate.edu http://grunwaldlab.cgrb.oregonstate.edu/ | phytophthora-id.org http://phytophthora-id.org/ | oregonstate.edu/instruct/dce/phytophthora http://oregonstate.edu/instruct/dce/phytophthora | phytophthora-smallrna-db.cgrb.oregonstate.edu http://phytophthora-smallrna-db.cgrb.oregonstate.edu/

emmanuelparadis commented 9 years ago

I switched to chr22 of 1K Genomes: it has already 1,103,537 loci and is 214 Mb (10.5 Gb uncompressed but no need to uncompress it to read thank to R's binary connections... nice stuff btw). It's possible to read in chunks of 1 Gb, scan the chunks for the loci positions, and output them for the 1,103,537 loci. This takes ~1 min (and < 8 Gb of RAM).

The idea is to provide some functions to get similar info on the loci (their positions, their alleles, whether they are SNP, ...) so the user may select which ones to read.

pegas::read.vcf returns a "loci" object. Currently, it can read quite easily a few 1000s loci.

zkamvar commented 9 years ago

One question I have is: are we considering VCF files pre or post filtering? I believe it would be worthwhile to consider pre-filtering as it would get away from the current black box sort of methods available now.

emmanuelparadis commented 9 years ago

Other fields in the VCF file can be scanned in the same way than POS. read.vcf already returns QUAL and FILTER as additional attributes.

knausb commented 9 years ago

I'm not attending this event, but colleagues in my lab who are planning to attend brought this issue to my attention. I've been working on an R package to visualize and filter vcf files. Originally I wrote it all in R, and then realized its performance wouldn't scale well. I've subsequently been learning Rcpp and have been working on improving its performance. It has a vignette with a full working example. It can be installed with:

devtools::install_github(repo="knausb/vcfR", build_vignettes=TRUE)

And you can browse its vignette with:

browseVignettes(package="vcfR")

Its under active development, so things should change in the future. I'd be interested in any feedback people may have.