Sort out whether or not we support arbitrary BCF files for downstream steps

fedarko commented 2 years ago

Currently, the "downstream" analyses (e.g. hotspot detection) use bcf_utils.parse_bcf(), which will fail if the input BCF file wasn't produced by strainFlye call (because parse_bcf() raises an error if it doesn't see strainFlye-specific info in the BCF file).

If desired, we could try to get the downstream analyses to make use of arbitrary BCF files. Some challenges with this, however:

We'd need to raise an error upon seeing unsupported types of mutations (multi-allelic mutations [#28], indels, ...) That, or just avoid these mutations for various purposes.
The BCF files output by strainFlye include clear "custom" info fields on information on mismatch-based coverage and alternate nucleotide coverage. If we don't have these available, this may cause problems with some things (not sure at this point of any concrete examples outside of FDR estimation/fixing tho).

This is definitely possible, but to keep the scope of this reasonable I think it makes sense to stay strict for now. Better inconvenient and correct than convenient and wrong.

In any case, the tutorial and docs should be updated to clarify this situation.

Analyses that should be kept to strainFlye-created BCF files

fdr estimate
- (Requires that the BCF uses p- or r-mutations)

Analyses that could be generalized to arbitrary BCF files

spot hot-features
- As long as each mutation takes up a single position (and as long as there aren't multi-allelic mutations), this should be fine.
spot cold-gaps
- As long as each mutation takes up a single position (and as long as there aren't multi-allelic mutations), this should be fine.
smooth apply
- How do we decide what linear alignments to ignore? For strainFlye-generated mutations, we know that the mutation at a position will always be thought of as from the most common nucleotide into the second most common nucleotide. So we can ignore alignments at which a mutated position has the third or fourth most common nucleotide.
- But for arbitrary mutations, we don't have this guarantee. So we need to rely on the REF and ALT stuff. Actually... yeah, this works perfectly, even for unreasonable positions where the contig nt at a position != consensus, since strainflye will set REF to the consensus. nice
link-graph compute

fedarko commented 2 years ago

The easiest way to handle arbitrary BCF files is adding a separate parsing function analogous to parse_bcf(), intended for use with analyses "after" FDR estimation/fixing (hotspot, coldspot, smooth, link graph, covskew, matrix).

This new function should take in the BCF and somehow limit our focus within it to single-nucleotide, single-allelic mutations, so that this BCF can be used as a drop-in replacement in the downstream steps. Maybe this means filtering out indels / other stuff; maybe it just means raising an error if these "complex" types of mutations exist. Or maybe we add a utility function to perform a one-time conversion of arbitrary BCF files to those accepted by strainFlye (filtering "complex" mutations), and then ask users to supply the resulting BCF for downstream analyses.

In any case, this should hopefully be a one-time cost, so that I won't have to add a custom check for each downstream analysis. There may be parts of the "filter" or check that aren't perfect, so adjust docs to be clear that at the moment we only support a subset of types of mutations.

fedarko commented 2 years ago

OK, split parse_bcf() into parse_sf_bcf() and parse_arbitrary_bcf(). Need to just add various sanity checks to parse_arbitrary_bcf(), and test, and then we should be good to go.

fedarko / strainFlye

Sort out whether or not we support arbitrary BCF files for downstream steps #30

Analyses that should be kept to strainFlye-created BCF files

Analyses that could be generalized to arbitrary BCF files