Create taxonomic classification snakemake file

charlesreid1 commented 6 years ago

The instructions in the taxonomic classification README file should be converted to a Snakemake file.

charlesreid1 commented 6 years ago

The readme instructions that should be turned into snakemake files starts here.

charlesreid1 commented 6 years ago

Some scripts useful for bootstrapping a fresh AWS node are at dahak-yeti (yeti is the name I'm using for beefy AWS nodes).

charlesreid1 commented 6 years ago

written and working:

[x] script to install python/conda/snakemake
- install_pyenv.sh
- install_snakemake.sh
[x] script to install singularity
- install_singularity.sh
[x] script to pull biocontainers
- prepare_biocontainers.sh
[x] script to get trimmed data
- get_trimmed_data.py
[x] script to get sequence bloom trees
- get_sbt.sh
[x] script to calculate signatures
- calculate_signatures.sh
[x] script to compare signatures to database
- compare_components.sh
[x] script to download/unpack kaiju database
- unpack_kaiju.sh
[x] script to run kaiju
- run_kaiju.sh
[ ] compare_components script
- ~~waiting for compare_components.sh script to finish running.~~
- compare_components.sh script was running okay, but ended up running for over 48 hours, so I killed it.
[x] script to convert kaiju output to krona output
[x] script to filter reads
[x] script to pull krona container image
[x] script to generate krona output

charlesreid1 commented 6 years ago

Note: bash shell scripts linked above have all been converted over to Python, for eventual translation into Snakemake.

Scripts mentioned below are in the scripts directory of dahak-yeti. You can check out a copy of this repository on a fresh AWS node, and have the entire workflow run start to finish.

Part 1 is the setup for the workflow:

run_all_part1.sh runs three installation scripts for pyenv, snakemake, and singularity

Part 2 is the taxonomic classification workflow:

run_all_part2.sh runs the remaining scripts in the taxonomic classification workflow

Note: compare_components script still has not been converted.

charlesreid1 commented 6 years ago

Good news: each step of the taxonomic classification workflow is now running with Python shell scripts (subprocess, glob, etc.)! These can use some cleanup, but they have each been tested.

Next step is to create a Snakefile from these Python scripts. We still have outstanding questions about Snakemake workflows and dependencies, but for now I am going to keep the taxonomic workflow Snakemake file standalone. It can be incorporated with other workflow Snakemake files at a later time.

charlesreid1 commented 6 years ago

Created a repository for deploying Snakemake workflows on a fresh AWS node. It's called dahak-flot and will contain a few Snakefiles:

one or two illustrative Snakefiles demonstrating how to branch different workflows
taxonomic classification workflow Snakefile
other workflows (future work)

(Note that dahak-flot is not the final version of any workflow, final versions of everything will be in official dahak repos.)

charlesreid1 commented 6 years ago

As mentioned on Slack, we should take a look at the ymp project and take some inspiration from their Stage object.

A Stage is a directory that is a self-contained group of steps, but that occurs in a particular order. The Stage object is just a wrapper around normal Snakemake rules, but it provides some magic keywords, {:this:}, {:that:}, and {:prev:}, for referring to the current stage's directory, as well as the directory of prior stages.

For an example of how this is used, see bbmap.rules line 44.

This concept and the examples in ymp will definitely be useful in developing dahak workflows.

charlesreid1 commented 6 years ago

Partially tested Snakefile for taxonomic classification has been added to the dahak-flot repo here. This implements one of many ways of organizing the Snakefile, and there are still further improvements to make...

One of the biggest challenges I've had with Snakemake is the fact that even though you have the {} syntax and mini-formatting language, a lot of the filenames end up being hard-coded because of the way {variable} in the inputs/outputs block becomes {wildcards.variable} in the shell/run block, or the variables need to be modified to change the output path or extension, etc.

charlesreid1 commented 6 years ago

The current Snakefile is monolithic (every rule is in one big Snakefile). Next step is to break each rule into a separate file, and then run through the remaining tests. Once it's been verified, we'll close this issue and (finally!) have our taxonomic workflow Snakefile.

That would be a good opportunity to spend some time on the following:

Documenting the process
- Particularly, we should address the complication of proliferating versions of things
- Currently: a readme with step-by-step shell commands, a shell script, a Python script, a Snakefile, etc.
- What do we want to include in documentation?
- Starting Sphinx workflow for documentation (addressing #43)

Standardizing the process
- Snakefile "best practices": get everyone's buy-in before writing lots of different Snakefiles with totally different styles
- Settle on (first pass) method for testing workflows (schedule? AWS resources?)
- Scripts to deploy AWS infrastructure, passing init script to machines, copying in Snakefile scripts (poss. other data as well)

Benchmarking the process
- Using the spy server to profile and benchmark the workflows, particularly during testing
- Further benefits from automating AWS infrastructure

I'll expand on these ideas in the Projects section and add issues as needed, just collecting these thoughts here.

ctb commented 6 years ago

+1

charlesreid1 commented 6 years ago

Completed Snakefile: Snakefile in dahak-flot repo

dahak-metagenomics / dahak

Create taxonomic classification snakemake file #45