carpentries-incubator / Pipeline_Training_with_Nextflow

The SMART survey team requested training to use and maintain their existing Nextflow based workflow that they are using on Pawsey/Garrawarla.
https://carpentries-incubator.github.io/Pipeline_Training_with_Nextflow/
Other
1 stars 1 forks source link

Structure of training module #1

Closed PaulHancock closed 1 year ago

PaulHancock commented 2 years ago

I suggest we work through the content in the following order (but not necessarily divided into these particular episodes).

  1. Workflows / pipelines (without mentioning Nextflow)
    • Talk about how much of the work that we do for research can be thought of in terms of a workflow or pipeline, how this can be visualised as a flowchart, and how we break things into blocks of work to be done, information/data that is passed between the blocks, and some optional flow control.
    • Get people thinking about how data move through this workflow (a data driven workflow).
    • Do a small exercise where people take some of their own work and map out a basic workflow, defining the tasks and what data flows between them
  2. Intro to Nextflow (channels but no processes yet)
    • Talk about all of the above again but in the context of Nextflow.
    • Note that Nextflow is a data driven workflow manager and thus we spend a lot of time thinking about the links (channels) between the tasks (processes). The way that we define a process may need to change to accommodate the required input/output (or we can change the channels).
    • Start with a workflow that has only channel manipulation and no processes.
    • Show people how to run Nextflow and interpret the interface
    • Do a few exercises that focus on channel manipulation:
      • split/gather/glob/csv/path/of
  3. Nextflow workflows (working with processes)
    • Once they are sick of channel manipulation, introduce a process which creates a file using an input filename and/or echos something into a file.
    • Show/explain how the work and results directories work, and all the hidden .command.* files in the work directories, and how the CLI interface shows you the (start of) the work directory names for each process. Also show people the trace.txt file that contains the full set of process->workdir mappings for easier debugging.
    • Give a more full description of the parts of a process including: inputs, outputs, script/shell/exec, publish, when.
    • Introduce DSL2 and show how you can pipe the output -> input, do an exercise where you have to manipulate the channel before passing it: map/gather/transmute.
    • Introduce the params namespace and show people how to set things in the workflow, in the config file, and from the command line
    • Talk about restarting a workflow and demonstrate the use of cached results.
  4. Nextflow orchestration
    • So far everything should have been running on the users local machine
    • Discuss where Nextflow looks for scripts and how to activate environments per process/label
    • Use profiles for differently labelled processes
    • Intro to using Nextflow on a supercomputer: telling Nextflow about slurm/pbs/other so that it doesn't run jobs on the head node. This is a good place to talk about tmux/screen/etc, and using the host name to set the profile.
    • Telling Nextflow to use containers to run your code. Building containers is probably out of scope, though being able to pull them or at least reference singularity images on Pawsey is in scope.
  5. Containers and environments
    • Telling Nextflow to use containers to run your code.
    • Using containers from a singularity image
    • Pulling images from docker hub or quay.io
    • Basics of building a container (use this if people can't install docker on their own machines).
  6. Nextflow best practices (and adv. things)
    • Separating workflow (my_thing.nf) from the config (nextflow.config)
    • A CLI using --help
    • Comments, comments, documentation, and more comments.
    • Modular workflows, reusing previous workflows.
    • Error strategy, exit codes, flow control
    • Embedding Nextflow version/invocation meta data into your outputs, even if it's just a file called metadata.txt in your results directory.

The above is mostly what we have already, but with a starting point prior to nextflow, and that should hopefully make the learning curve a little easier to deal with.

PaulHancock commented 2 years ago

Paul to do: 1 + 5, 6 Nick to do: 2,3,4, 6

NickSwainston commented 2 years ago

If we split this into users and developers, it will be worth putting tmux/screen in ep 3. I'll leave in ep 3 for now but it's easy to move

NickSwainston commented 2 years ago

A checklist to keep track of what I've done so far and let me rearrange as I improve the presentation order

  1. Workflows / pipelines (without mentioning Nextflow)
    • [x] Talk about how much of the work that we do for research can be thought of in terms of a workflow or pipeline, how this can be visualised as a flowchart, and how we break things into blocks of work to be done, information/data that is passed between the blocks, and some optional flow control.
    • [x] Get people thinking about how data move through this workflow (a data driven workflow).
    • [x] Do a small exercise where people take some of their own work and map out a basic workflow, defining the tasks and what data flows between them
  2. Intro to Nextflow (channels but no processes yet)
    • [x] Talk about all of the above again but in the context of Nextflow.
    • [x] Note that Nextflow is a data driven workflow manager and thus we spend a lot of time thinking about the links (channels) between the tasks (processes). The way that we define a process may need to change to accommodate the required input/output (or we can change the channels).
    • [x] Use a simple example to introduce a process which creates a file using an input filename and/or echos something into a file.
    • [x] Now that they have context, focus on channel manipulation with no processes.
    • [x] Do a few exercises that focus on channel manipulation:
      • split/gather/glob/csv/path/of
    • [x] Show people how to run Nextflow and interpret the interface
  3. Nextflow workflows (working with processes)
    • [x] Show/explain how the work and results directories work, and all the hidden .command.* files in the work directories, and how the CLI interface shows you the (start of) the work directory names for each process.
    • [X] Also show people the trace.txt file that contains the full set of process->workdir mappings for easier debugging.
    • [x] Give a more full description of the parts of a process including: inputs, outputs, script/shell/exec, publish, when.
    • [x] Introduce DSL2 and show how you can pipe the output -> input, do an exercise where you have to manipulate the channel before passing it: map/gather/transmute.
    • [x] Introduce the params namespace and show people how to set things in the workflow, in the config file, and from the command line
    • [x] Talk about restarting a workflow and demonstrate the use of cached results.
    • [x] This is a good place to talk about tmux/screen/etc, and using the host name to set the profile.
  4. Nextflow orchestration
    • [x] So far everything should have been running on the users local machine
    • [x] Discuss where Nextflow looks for scripts and how to activate environments per process/label
    • [x] Use profiles for differently labelled processes
    • [x] Intro to using Nextflow on a supercomputer: telling Nextflow about slurm/pbs/other so that it doesn't run jobs on the head node.
    • [x] Telling Nextflow to use containers to run your code. Building containers is probably out of scope, though being able to pull them or at least reference singularity images on Pawsey is in scope.
  5. Containers and environments
    • [x] Using containers from a singularity image
    • [x] Pulling images from docker hub or quay.io
    • [x] Basics of building a container (use this if people can't install docker on their own machines).
  6. Nextflow best practices (and adv. things)
    • [x] Separating workflow (my_thing.nf) from the config (nextflow.config)
    • [x] A CLI using --help
    • [x] Comments, comments, documentation, and more comments.
    • [x] Modular workflows, reusing previous workflows.
    • [x] Error strategy, exit codes, flow control
    • [x] Embedding Nextflow version/invocation meta data into your outputs, even if it's just a file called metadata.txt in your results directory.

The above is mostly what we have already, but with a starting point prior to nextflow, and that should hopefully make the learning curve a little easier to deal with.

NickSwainston commented 2 years ago

Since I use containers in 4 it is probably best to talk about 5 before 4

NickSwainston commented 2 years ago

We shouldn't count on the play with docker tutorial, as I just tried to use it and it was out of capacity so I couldn't use it.

NickSwainston commented 2 years ago

Add best practices like giving multiple channel outputs names and briefly talk about nf-core