dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Create taxonomic classification snakemake file #45

Closed charlesreid1 closed 6 years ago

charlesreid1 commented 6 years ago

The instructions in the taxonomic classification README file should be converted to a Snakemake file.

charlesreid1 commented 6 years ago

The readme instructions that should be turned into snakemake files starts here.

charlesreid1 commented 6 years ago

Some scripts useful for bootstrapping a fresh AWS node are at dahak-yeti (yeti is the name I'm using for beefy AWS nodes).

charlesreid1 commented 6 years ago

written and working:

charlesreid1 commented 6 years ago

Note: bash shell scripts linked above have all been converted over to Python, for eventual translation into Snakemake.

Scripts mentioned below are in the scripts directory of dahak-yeti. You can check out a copy of this repository on a fresh AWS node, and have the entire workflow run start to finish.

Part 1 is the setup for the workflow:

Part 2 is the taxonomic classification workflow:

Note: compare_components script still has not been converted.

charlesreid1 commented 6 years ago

Good news: each step of the taxonomic classification workflow is now running with Python shell scripts (subprocess, glob, etc.)! These can use some cleanup, but they have each been tested.

Next step is to create a Snakefile from these Python scripts. We still have outstanding questions about Snakemake workflows and dependencies, but for now I am going to keep the taxonomic workflow Snakemake file standalone. It can be incorporated with other workflow Snakemake files at a later time.

charlesreid1 commented 6 years ago

Created a repository for deploying Snakemake workflows on a fresh AWS node. It's called dahak-flot and will contain a few Snakefiles:

(Note that dahak-flot is not the final version of any workflow, final versions of everything will be in official dahak repos.)

charlesreid1 commented 6 years ago

As mentioned on Slack, we should take a look at the ymp project and take some inspiration from their Stage object.

A Stage is a directory that is a self-contained group of steps, but that occurs in a particular order. The Stage object is just a wrapper around normal Snakemake rules, but it provides some magic keywords, {:this:}, {:that:}, and {:prev:}, for referring to the current stage's directory, as well as the directory of prior stages.

For an example of how this is used, see bbmap.rules line 44.

This concept and the examples in ymp will definitely be useful in developing dahak workflows.

charlesreid1 commented 6 years ago

Partially tested Snakefile for taxonomic classification has been added to the dahak-flot repo here. This implements one of many ways of organizing the Snakefile, and there are still further improvements to make...

One of the biggest challenges I've had with Snakemake is the fact that even though you have the {} syntax and mini-formatting language, a lot of the filenames end up being hard-coded because of the way {variable} in the inputs/outputs block becomes {wildcards.variable} in the shell/run block, or the variables need to be modified to change the output path or extension, etc.

charlesreid1 commented 6 years ago

The current Snakefile is monolithic (every rule is in one big Snakefile). Next step is to break each rule into a separate file, and then run through the remaining tests. Once it's been verified, we'll close this issue and (finally!) have our taxonomic workflow Snakefile.

That would be a good opportunity to spend some time on the following:




I'll expand on these ideas in the Projects section and add issues as needed, just collecting these thoughts here.

ctb commented 6 years ago

+1

charlesreid1 commented 6 years ago

Completed Snakefile: Snakefile in dahak-flot repo