Arcadia-Science / metagenomics

A Nextflow workflow for QC, evaluation, and profiling of metagenomic samples using short- and long-read technologies
MIT License
34 stars 2 forks source link

Sourmash gather implementation #49

Closed elizabethmcd closed 1 year ago

elizabethmcd commented 1 year ago

This pull request adds support for sourmash gather by requiring an input CSV listing the paths to the gather databases and corresponding lineage CSV files. I have not added full support for taxannotate just yet although it is added and commented out, because I need to find a test gather database that has some sort of matches to my test samples to ensure that that's working properly.

It sets up the gather channels so that each sample-database pair runs separately, although sourmash gather can take a list of all the databases which could make the output easier to deal with. I couldn't figure out how to cleanly handle this as input while tracking what databases were ran, so the CSV input and splitting into different channels was the way I went.

The test CI workflows should work as I put the sourmash k31 viral database in the Arcadia-Science/test-datasets repo as this was the smallest thing I could think to work with, the other option would be the contamination database we use in the seqqc workflow but I'm not sure if there's a lineage CSV for that? All TBD and would appreciate any and all feedback as I had to do quite a bit of trial and error to try to best figure out how to configure this.

elizabethmcd commented 1 year ago

I have fixed the channel input for sourmash gather and sourmash taxonomy annotate so that all databases that are input to a CSV are run at the same time for each sample. The first part is addressed by fixing the test-datasets repo to include sourmash databases that are small and do hit against the samples (in this case they hit against the reads but not the assemblies in some cases, so I've also changed to control for that), in this PR: https://github.com/Arcadia-Science/test-datasets/pull/21. The second part is addressed by reading in the CSV and using the .collect() and .toList() methods.

elizabethmcd commented 1 year ago

I've added in more comments to make it clear why I have things set up the way I do and that they do work in the intended ways