broadinstitute / gamgee

A C++14 library for NGS data formats
http://broadinstitute.github.io/gamgee/
MIT License
40 stars 13 forks source link

Account for the several ways to read variant files in htslib #222

Open jmthibault79 opened 10 years ago

jmthibault79 commented 10 years ago

There are at least three ways to read variant files in htslib: indexed, unindexed, and synced. Each has advantages and drawbacks. Enable the use of all of these via Variant Reader/Iterators and make a record of when it's appropriate to use each.

Where multiple options are available, run benchmarks to determine which is best.

jmthibault79 commented 10 years ago

Also make sure there are tests for all of the valid combinations.

jmthibault79 commented 10 years ago
jmthibault79 commented 10 years ago
VariantReader IndexedVR MultipleVR SyncedVR
VCF YES no YES no
VCF GZ YES no YES YES
BCF YES YES YES YES
Single YES YES YES YES
Multiple no no YES YES
Index no YES no YES
Interval no YES no YES
Requires Index no YES no YES
Requires Interval no no no no (#236)
jmthibault79 commented 10 years ago

Missing functionality:

Unknown/untested:

jmthibault79 commented 10 years ago

SyncedVariantReader works with no intervals after #236

MauricioCarneiro commented 10 years ago

Added ticket #237 for the single indexed BCF question. Added ticket #239 for intervals with VCF files Added ticket #240 for Intervals with unindexed files (related because htslib doesn't handle VCF indices)

I would ignore BCF GZ, that's an abomination because BCF's can be intrinsically gzipped....