Closed dpark01 closed 5 months ago
After a few runs on production data, the general methodological changes here look good. Some parameter tuning probably remains to be optimized in order to increase the clustering/matching between various rhino C species, but that can be experimented with separately and introduced in a later PR.
This PR introduces ANI-based mechanisms for reference selection using
skani
. This:select_references
to tasks_assembly.wdl that selects a subset of reference genomes based on ANI similarity to a set of given contigs/MAGs. This also clusters similar reference genomes to each other based on ANI and chooses only the top reference per cluster.scaffold
's behavior in task_assembly.wdl to chose a reference genome based on ANI when provided multiple reference genomes.scaffold_and_refine_multitaxa
to:This updated version of
scaffold_and_refine_multitaxa
should be far more efficient at metagenomic reference selection, produce less noisy outputs (less secondary taxon hits), still allow for diverse coinfections of unrelated taxa, and should perform well with a very large reference genome database as input.This PR also introduces an unrelated change to modify the workflow
terra_tsv_to_table
to accept and concatenate multiple input tsv files.