FINNGEN / autoreporting

MIT License
0 stars 1 forks source link

Change annotation to a method that is deterministic in runtime, but slower for small inputs #162

Closed Lipastomies closed 3 years ago

Lipastomies commented 3 years ago

tl;dr instead of tabixing rows from annotation files, go through them row by row and pick the variants that are in our df. Slow-ish since it's about 1-20 mins per every annotation file, but nowhere near the 12 hrs+ for tabixing a file millions of times.

I could perhaps make some sort of threshold for switching between these methods, so tabixing for small inputs and full annotation traversal for large inputs, but that would complicate things even more. As it is now, this makes things more deterministic than the tabix method, where most of the reports finished in a decent amount of time, under an hour, but some went on for almost 24 hours, got pre-empted a few times and running the autoreporting pipeline was like playing the lottery, you could win the jackpot and everything could go smooth but more often than not things would be slow.

Fixes #129

Lipastomies commented 3 years ago

This was how it was run for R6, merging this since it at least works.