Closed charlesreid1 closed 6 years ago
The readme instructions that should be turned into snakemake files starts here.
Some scripts useful for bootstrapping a fresh AWS node are at dahak-yeti (yeti is the name I'm using for beefy AWS nodes).
written and working:
Note: bash shell scripts linked above have all been converted over to Python, for eventual translation into Snakemake.
Scripts mentioned below are in the scripts directory of dahak-yeti. You can check out a copy of this repository on a fresh AWS node, and have the entire workflow run start to finish.
Part 1 is the setup for the workflow:
Part 2 is the taxonomic classification workflow:
Note: compare_components script still has not been converted.
Good news: each step of the taxonomic classification workflow is now running with Python shell scripts (subprocess, glob, etc.)! These can use some cleanup, but they have each been tested.
Next step is to create a Snakefile from these Python scripts. We still have outstanding questions about Snakemake workflows and dependencies, but for now I am going to keep the taxonomic workflow Snakemake file standalone. It can be incorporated with other workflow Snakemake files at a later time.
Created a repository for deploying Snakemake workflows on a fresh AWS node. It's called dahak-flot and will contain a few Snakefiles:
(Note that dahak-flot is not the final version of any workflow, final versions of everything will be in official dahak repos.)
As mentioned on Slack, we should take a look at the ymp project and take some inspiration from their Stage object.
A Stage is a directory that is a self-contained group of steps, but that occurs in a particular order. The Stage object is just a wrapper around normal Snakemake rules, but it provides some magic keywords, {:this:}
, {:that:}
, and {:prev:}
, for referring to the current stage's directory, as well as the directory of prior stages.
For an example of how this is used, see bbmap.rules line 44.
This concept and the examples in ymp will definitely be useful in developing dahak workflows.
Partially tested Snakefile for taxonomic classification has been added to the dahak-flot repo here. This implements one of many ways of organizing the Snakefile, and there are still further improvements to make...
One of the biggest challenges I've had with Snakemake is the fact that even though you have the {}
syntax and mini-formatting language, a lot of the filenames end up being hard-coded because of the way {variable}
in the inputs/outputs block becomes {wildcards.variable}
in the shell/run block, or the variables need to be modified to change the output path or extension, etc.
The current Snakefile is monolithic (every rule is in one big Snakefile). Next step is to break each rule into a separate file, and then run through the remaining tests. Once it's been verified, we'll close this issue and (finally!) have our taxonomic workflow Snakefile.
That would be a good opportunity to spend some time on the following:
I'll expand on these ideas in the Projects section and add issues as needed, just collecting these thoughts here.
+1
Completed Snakefile: Snakefile in dahak-flot repo
The instructions in the taxonomic classification README file should be converted to a Snakemake file.