broadinstitute / viral-pipelines

viral-ngs: complete pipelines
Other
51 stars 28 forks source link

scaffolding and reference selection based on ANI #528

Closed dpark01 closed 5 months ago

dpark01 commented 5 months ago

This PR introduces ANI-based mechanisms for reference selection using skani. This:

This updated version of scaffold_and_refine_multitaxa should be far more efficient at metagenomic reference selection, produce less noisy outputs (less secondary taxon hits), still allow for diverse coinfections of unrelated taxa, and should perform well with a very large reference genome database as input.

This PR also introduces an unrelated change to modify the workflow terra_tsv_to_table to accept and concatenate multiple input tsv files.

dpark01 commented 5 months ago

After a few runs on production data, the general methodological changes here look good. Some parameter tuning probably remains to be optimized in order to increase the clustering/matching between various rhino C species, but that can be experimented with separately and introduced in a later PR.