bigdatagenomics / avocado

A Variant Caller, Distributed. Apache 2 licensed.
http://bdgenomics.org/projects/avocado/
Apache License 2.0
71 stars 42 forks source link

Calling with Avocado using the "hive" range partitioned data #283

Open jpdna opened 6 years ago

jpdna commented 6 years ago

Hi @fnothaft - I'd like to demonstrate joint calling of genotypes using Avocado for a specific genomics regions using the bin "hive-style" partitioned data. Input: 1) gVCF files for 10+ for 100s of samples saved as the bin range partitioned ADAM parquet datasets 2) bam files saved as ADAM bin partitioned datasets.

The application here I imagine is where there was a desire for on-the-fly recalling of a specific region in a case where new samples are added and a set of candidate regions need to be examined in near real-time. This would include a feature allowing user to provide a BED file of region to calling, as genotypeGVCFs allows for in GATK/Haplotypecaller.

My plan is to make Avocado be able to load partitioned data from my ADAM "hive" binned dataset branch, and with that I think it will just work, and I'll measure performance. Let me know if you have suggestions / comments about the usefulness of this.