Closed fedarko closed 2 years ago
The easiest way to handle arbitrary BCF files is adding a separate parsing function analogous to parse_bcf()
, intended for use with analyses "after" FDR estimation/fixing (hotspot, coldspot, smooth, link graph, covskew, matrix).
This new function should take in the BCF and somehow limit our focus within it to single-nucleotide, single-allelic mutations, so that this BCF can be used as a drop-in replacement in the downstream steps. Maybe this means filtering out indels / other stuff; maybe it just means raising an error if these "complex" types of mutations exist. Or maybe we add a utility function to perform a one-time conversion of arbitrary BCF files to those accepted by strainFlye (filtering "complex" mutations), and then ask users to supply the resulting BCF for downstream analyses.
In any case, this should hopefully be a one-time cost, so that I won't have to add a custom check for each downstream analysis. There may be parts of the "filter" or check that aren't perfect, so adjust docs to be clear that at the moment we only support a subset of types of mutations.
OK, split parse_bcf()
into parse_sf_bcf()
and parse_arbitrary_bcf()
. Need to just add various sanity checks to parse_arbitrary_bcf()
, and test, and then we should be good to go.
Currently, the "downstream" analyses (e.g. hotspot detection) use
bcf_utils.parse_bcf()
, which will fail if the input BCF file wasn't produced bystrainFlye call
(becauseparse_bcf()
raises an error if it doesn't see strainFlye-specific info in the BCF file).If desired, we could try to get the downstream analyses to make use of arbitrary BCF files. Some challenges with this, however:
This is definitely possible, but to keep the scope of this reasonable I think it makes sense to stay strict for now. Better inconvenient and correct than convenient and wrong.
In any case, the tutorial and docs should be updated to clarify this situation.
Analyses that should be kept to strainFlye-created BCF files
fdr estimate
Analyses that could be generalized to arbitrary BCF files
spot hot-features
spot cold-gaps
smooth apply
link-graph compute