Structure of training module

PaulHancock commented 2 years ago

I suggest we work through the content in the following order (but not necessarily divided into these particular episodes).

Workflows / pipelines (without mentioning Nextflow)
- Talk about how much of the work that we do for research can be thought of in terms of a workflow or pipeline, how this can be visualised as a flowchart, and how we break things into blocks of work to be done, information/data that is passed between the blocks, and some optional flow control.
- Get people thinking about how data move through this workflow (a data driven workflow).
- Do a small exercise where people take some of their own work and map out a basic workflow, defining the tasks and what data flows between them
Intro to Nextflow (channels but no processes yet)
- Talk about all of the above again but in the context of Nextflow.
- Note that Nextflow is a data driven workflow manager and thus we spend a lot of time thinking about the links (channels) between the tasks (processes). The way that we define a process may need to change to accommodate the required input/output (or we can change the channels).
- Start with a workflow that has only channel manipulation and no processes.
- Show people how to run Nextflow and interpret the interface
- Do a few exercises that focus on channel manipulation:
  - split/gather/glob/csv/path/of
Nextflow workflows (working with processes)
- Once they are sick of channel manipulation, introduce a process which creates a file using an input filename and/or echos something into a file.
- Show/explain how the work and results directories work, and all the hidden .command.* files in the work directories, and how the CLI interface shows you the (start of) the work directory names for each process. Also show people the trace.txt file that contains the full set of process->workdir mappings for easier debugging.
- Give a more full description of the parts of a process including: inputs, outputs, script/shell/exec, publish, when.
- Introduce DSL2 and show how you can pipe the output -> input, do an exercise where you have to manipulate the channel before passing it: map/gather/transmute.
- Introduce the params namespace and show people how to set things in the workflow, in the config file, and from the command line
- Talk about restarting a workflow and demonstrate the use of cached results.
Nextflow orchestration
- So far everything should have been running on the users local machine
- Discuss where Nextflow looks for scripts and how to activate environments per process/label
- Use profiles for differently labelled processes
- Intro to using Nextflow on a supercomputer: telling Nextflow about slurm/pbs/other so that it doesn't run jobs on the head node. This is a good place to talk about tmux/screen/etc, and using the host name to set the profile.
- Telling Nextflow to use containers to run your code. Building containers is probably out of scope, though being able to pull them or at least reference singularity images on Pawsey is in scope.
Containers and environments
- Telling Nextflow to use containers to run your code.
- Using containers from a singularity image
- Pulling images from docker hub or quay.io
- Basics of building a container (use this if people can't install docker on their own machines).
Nextflow best practices (and adv. things)
- Separating workflow (my_thing.nf) from the config (nextflow.config)
- A CLI using --help
- Comments, comments, documentation, and more comments.
- Modular workflows, reusing previous workflows.
- Error strategy, exit codes, flow control
- Embedding Nextflow version/invocation meta data into your outputs, even if it's just a file called metadata.txt in your results directory.

The above is mostly what we have already, but with a starting point prior to nextflow, and that should hopefully make the learning curve a little easier to deal with.

PaulHancock commented 2 years ago

Paul to do: 1 + 5, 6 Nick to do: 2,3,4, 6