tl;dr instead of tabixing rows from annotation files, go through them row by row and pick the variants that are in our df.
Slow-ish since it's about 1-20 mins per every annotation file, but nowhere near the 12 hrs+ for tabixing a file millions of times.
I could perhaps make some sort of threshold for switching between these methods, so tabixing for small inputs and full annotation traversal for large inputs, but that would complicate things even more. As it is now, this makes things more deterministic than the tabix method, where most of the reports finished in a decent amount of time, under an hour, but some went on for almost 24 hours, got pre-empted a few times and running the autoreporting pipeline was like playing the lottery, you could win the jackpot and everything could go smooth but more often than not things would be slow.
tl;dr instead of tabixing rows from annotation files, go through them row by row and pick the variants that are in our df. Slow-ish since it's about 1-20 mins per every annotation file, but nowhere near the 12 hrs+ for tabixing a file millions of times.
I could perhaps make some sort of threshold for switching between these methods, so tabixing for small inputs and full annotation traversal for large inputs, but that would complicate things even more. As it is now, this makes things more deterministic than the tabix method, where most of the reports finished in a decent amount of time, under an hour, but some went on for almost 24 hours, got pre-empted a few times and running the autoreporting pipeline was like playing the lottery, you could win the jackpot and everything could go smooth but more often than not things would be slow.
Fixes #129