fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Sort out whether or not we support arbitrary BCF files for downstream steps #30

Closed fedarko closed 2 years ago

fedarko commented 2 years ago

Currently, the "downstream" analyses (e.g. hotspot detection) use bcf_utils.parse_bcf(), which will fail if the input BCF file wasn't produced by strainFlye call (because parse_bcf() raises an error if it doesn't see strainFlye-specific info in the BCF file).

If desired, we could try to get the downstream analyses to make use of arbitrary BCF files. Some challenges with this, however:

  1. We'd need to raise an error upon seeing unsupported types of mutations (multi-allelic mutations [#28], indels, ...) That, or just avoid these mutations for various purposes.
  2. The BCF files output by strainFlye include clear "custom" info fields on information on mismatch-based coverage and alternate nucleotide coverage. If we don't have these available, this may cause problems with some things (not sure at this point of any concrete examples outside of FDR estimation/fixing tho).

This is definitely possible, but to keep the scope of this reasonable I think it makes sense to stay strict for now. Better inconvenient and correct than convenient and wrong.

In any case, the tutorial and docs should be updated to clarify this situation.

Analyses that should be kept to strainFlye-created BCF files

Analyses that could be generalized to arbitrary BCF files

fedarko commented 2 years ago

The easiest way to handle arbitrary BCF files is adding a separate parsing function analogous to parse_bcf(), intended for use with analyses "after" FDR estimation/fixing (hotspot, coldspot, smooth, link graph, covskew, matrix).

This new function should take in the BCF and somehow limit our focus within it to single-nucleotide, single-allelic mutations, so that this BCF can be used as a drop-in replacement in the downstream steps. Maybe this means filtering out indels / other stuff; maybe it just means raising an error if these "complex" types of mutations exist. Or maybe we add a utility function to perform a one-time conversion of arbitrary BCF files to those accepted by strainFlye (filtering "complex" mutations), and then ask users to supply the resulting BCF for downstream analyses.

In any case, this should hopefully be a one-time cost, so that I won't have to add a custom check for each downstream analysis. There may be parts of the "filter" or check that aren't perfect, so adjust docs to be clear that at the moment we only support a subset of types of mutations.

fedarko commented 2 years ago

OK, split parse_bcf() into parse_sf_bcf() and parse_arbitrary_bcf(). Need to just add various sanity checks to parse_arbitrary_bcf(), and test, and then we should be good to go.