Closed elizabethmcd closed 1 year ago
I have fixed the channel input for sourmash gather
and sourmash taxonomy annotate
so that all databases that are input to a CSV are run at the same time for each sample. The first part is addressed by fixing the test-datasets repo to include sourmash databases that are small and do hit against the samples (in this case they hit against the reads but not the assemblies in some cases, so I've also changed to control for that), in this PR: https://github.com/Arcadia-Science/test-datasets/pull/21. The second part is addressed by reading in the CSV and using the .collect()
and .toList()
methods.
I've added in more comments to make it clear why I have things set up the way I do and that they do work in the intended ways
This pull request adds support for
sourmash gather
by requiring an input CSV listing the paths to the gather databases and corresponding lineage CSV files. I have not added full support fortaxannotate
just yet although it is added and commented out, because I need to find a test gather database that has some sort of matches to my test samples to ensure that that's working properly.It sets up the gather channels so that each sample-database pair runs separately, although
sourmash gather
can take a list of all the databases which could make the output easier to deal with. I couldn't figure out how to cleanly handle this as input while tracking what databases were ran, so the CSV input and splitting into different channels was the way I went.The test CI workflows should work as I put the sourmash k31 viral database in the Arcadia-Science/test-datasets repo as this was the smallest thing I could think to work with, the other option would be the contamination database we use in the seqqc workflow but I'm not sure if there's a lineage CSV for that? All TBD and would appreciate any and all feedback as I had to do quite a bit of trial and error to try to best figure out how to configure this.