Arcadia-Science / seqqc

A Nextflow pipeline to identify quality control issues with new sequencing data.
MIT License
28 stars 0 forks source link

Including sourmash taxonomy annotate in seqqc #18

Open taylorreiter opened 1 year ago

taylorreiter commented 1 year ago

sourmash taxonomy annotate adds taxonomic lineages to sourmash gather results. Adding a process for taxonomy to the seqqc pipeline would allow the multiqc outputs to be summarizable up the taxonomic lineage. This is useful because the "Top 5" category in the multiqc report also aggregates up across levels, so while only ~5% of the sample may be classifiable to the top 5 genomes, ~50% could be classifiable to the top 5 phyla or something. This would give insight into larger fractions of the classifiable sample (see the kraken multiqc report linked below).

Adding this process is an enhancement, not a requirement for utility of the pipeline. It will require use of the sourmash taxonomy annotate nf-core module (see PR linked below) and will need a multiqc module (see link to the kraken report below, code can be inspired by this module and the sourmash gather module).

PR for nf-core module for sourmash taxonomy: https://github.com/nf-core/modules/pull/2422 Taxonomy sheet for sourmash gather contamination database: https://osf.io/jpdte Link to kraken report that summarizes up taxonomic levels: https://telatin.github.io/microbiome-bioinformatics/data/multiqc/#kraken