biocpp / biocpp-io

BioC++ Input/Output library
https://biocpp.github.io
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

WIP tabix #49

Closed h-2 closed 2 years ago

h-2 commented 2 years ago

This PR adds Tabix supports and indexed VCF reading.

All Tabix code is currently detail and will probably stay there for now.

Some preliminary "benchmarks":

bio w/o index bio w/ index bcftools w/ index
filesystem inputs 628,748 18,557 13,051
time 53.1s 1.7s 3.2s

Due to architectural problems, I don't think we can ever get the IOPS to be as low as with htslib. Please see my comments in the PR. In practice, the results seem to still be OK, but this is just one example where I tried a region very far to the end of a 300MB compressed VCF. We definitely need to do more testing.

TODO

Irallia commented 2 years ago

Do you know about the BAMIntervalTree created by @joshuak94 for indexing BAM files in another way?

h-2 commented 2 years ago

Do you know about the BAMIntervalTree created by @joshuak94 for indexing BAM files in another way?

Thanks for the pointer! I talked to him about. But even if we want to support that, we also need to be able to handle the regular indexes.

h-2 commented 2 years ago

It doesn't yet contain all the tests I would have liked, but this is as much as I can currently do for this feature.