The current implementation of select_assembly.nf which runs bin/select_assembly.py does the following:
Identifies the dominant species: reads a Kraken2 report (kraken2.out.kraken2_screen) to determine the most dominant species in a sample.
Retrieves genome size: looks up the genome size and chromosome number for the identified species from an NCBI lookup file (get_ncbi.out.ncbi_lookup).
Selects chromosomal contigs: identifies chromosomal contigs based on genome size from the Flye assembly (not unicycler).
Compares contigs: checks if these contigs match those in a reconciled clusters directory and decides which set to use.
Organise output: organises the contigs into final and discarded folders, separates chromosomal and non-chromosomal contigs, and filters contig information.
Produces final output: creates a flag file indicating whether the consensus or Flye-only contigs were used.
This method is biased toward published reference assemblies available on NCBI and after user feedback, we feel we should explore whether we need to so heavily clean the assemblies to only select contigs matching what is in the reference.
Noting some changes/ideas that should be shipped with this:
Refactor this so only a single "best" assembly is run downstream (i.e. find amr, annotation etc.). Instead of having separate processes defined for e.g. either flye or consensus
Assess all assemblies, currently does not consider the unicycler assembly
The current implementation of
select_assembly.nf
which runsbin/select_assembly.py
does the following:kraken2.out.kraken2_screen
) to determine the most dominant species in a sample.get_ncbi.out.ncbi_lookup
).This method is biased toward published reference assemblies available on NCBI and after user feedback, we feel we should explore whether we need to so heavily clean the assemblies to only select contigs matching what is in the reference.