fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Speed up input BCF validation #43

Open fedarko opened 2 years ago

fedarko commented 2 years ago

Adding the "is bcf simple" checks made the process of verifying the BCF take ~15.32 seconds (edit: ok, around 12-16 seconds, maybe) on Bloom, as opposed to ~0.24 seconds from a few days ago (before I added all those checks).

In my view, this tradeoff is 100% worth it -- better slow and correct than fast and wrong. But it'd be nice to speed things up.

Some ideas:

  1. If the input BCF was produced by strainFlye, just take a leap of faith and assume it's OK (only use strict validation on outside inputs)
  2. Depending on how many contigs there are in the dataset, parallelize the checks across contigs

... and there are probs other ideas that would also work.