NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.
https://www.earthbiogenome.org/
GNU General Public License v3.0
11 stars 8 forks source link

Earth Biogenome Project - Pilot Workflow

The primary workflow for the Earth Biogenome Project Pilot at NBIS.

Workflow overview

General aim:

flowchart LR
    hifi[/ HiFi reads /] --> data_inspection
    ont[/ ONT reads /] -->  data_inspection
    hic[/ Hi-C reads /] --> data_inspection
    data_inspection[[ Data inspection ]] --> preprocessing
    preprocessing[[ Preprocessing ]] --> assemble
    assemble[[ Assemble ]] --> validation
    validation[[ Assembly validation ]] --> curation
    curation[[ Assembly curation ]] --> validation

Current implementation:

flowchart TD
    input[/ Input file/] --> hifi
    input --> hic
    input --> taxonkit[[ TaxonKit name2taxid/reformat ]]
    taxonkit --> goat_taxon[[ GOAT taxon search ]]
    goat_taxon --> busco
    goat_taxon --> dtol[[ DToL lookup ]]
    hifi --> samtools_fa[[ Samtools fasta ]]
    samtools_fa --> fastk_hifi
    samtools_fa --> mash_screen
    hifi[/ HiFi reads /] --> fastk_hifi[[ FastK - HiFi ]]
    hifi --> meryl_hifi[[ Meryl - HiFi ]]
    hic[/ Hi-C reads /] --> fastk_hic[[ FastK - Hi-C ]]
    hifi --> meryl_hic[[ Meryl - Hi-C ]]
    assembly[/ Assembly /] --> quast[[ Quast ]]
    fastk_hifi --> histex[[ Histex ]]
    histex --> genescopefk[[ GeneScopeFK ]]
    fastk_hifi --> ploidyplot[[ PloidyPlot ]]
    fastk_hifi --> katgc[[ KatGC ]]
    fastk_hifi --> merquryfk[[ MerquryFK ]]
    assembly --> merquryfk
    meryl_hifi --> merqury[[ Merqury ]]
    assembly --> merqury
    fastk_hifi --> katcomp[[ KatComp ]]
    fastk_hic --> katcomp
    assembly --> busco[[ Busco ]]
    refseq_sketch[( RefSeq sketch )] --> mash_screen[[ Mash Screen ]]
    hifi --> mash_screen
    fastk_hifi --> hifiasm[[ HiFiasm ]]
    hifiasm --> assembly
    assembly --> purgedups[[ Purgedups ]]
    input --> mitoref[[ Mitohifi - Find reference ]]
    assembly --> mitohifi[[ Mitohifi ]]
    assembly --> fcsgx[[ FCS GX ]]
    fcs_fetchdb[( FCS fetchdb )] --> fcsgx
    mitoref --> mitohifi
    genescopefk --> quarto[[ Quarto ]]
    goat_taxon --> multiqc[[ MultiQC ]]
    quarto --> multiqc
    dtol --> multiqc
    katgc --> multiqc
    ploidyplot --> multiqc
    busco --> multiqc
    quast --> multiqc

Usage

nextflow run -params-file <params.yml> \
    [ -c <custom.config> ] \
    [ -profile <profile> ] \
    NBISweden/Earth-Biogenome-Project-pilot

where:

Workflow parameter inputs

Mandatory:

Optional:

Software specific:

Tool specific settings are provided by supplying values to specific keys or supplying an array of settings under a tool name. The input to -params-file would look like this:

input: assembly.yml
outdir: results
fastk:
  kmer_size: 31
genescopefk:
  kmer_size: 31
hifiasm:
  opts01: "--opts A"
  opts02: "--opts B"
busco:
  lineages: 'auto'

Uppmax and PDC cluster specific:

Workflow outputs

All results are published to the path assigned to the workflow parameter results.

TODO:: List folder contents in results file

Customization for Uppmax

A custom profile named uppmax is available to run this workflow specifically on UPPMAX clusters. The process executor is slurm so jobs are submitted to the Slurm Queue Manager. All jobs submitted to slurm must have a project allocation. This is automatically added to the clusterOptions in the uppmax profile. All Uppmax clusters have node local disk space to do computations, and prevent heavy input/output over the network (which slows down the cluster for all). The path to this disk space is provided by the $SNIC_TMP variable, used by the process.scratch directive in the uppmax profile. Lastly the profile enables the use of Singularity so that all processes must be executed within Singularity containers. See nextflow.config for the profile specification.

The profile is enabled using the -profile parameter to nextflow:

nextflow run -profile uppmax <nextflow_script>

A NAISS compute allocation should also be supplied using the --project parameter.

Customization for PDC

A custom profile named dardel is available to run this workflow specifically on the PDC cluster Dardel. The process executor is slurm so jobs are submitted to the Slurm Queue Manager. All jobs submitted to slurm must have a project allocation. This is automatically added to the clusterOptions in the dardel profile. Calculations are performed in the scratch space allocated by PDC_TMP which is also on the lustre file system and is not node local storage. The path to this disk space is provided by the $PDC_TMP variable, used by the process.scratch directive in the dardel profile. Lastly the profile enables the use of Singularity so that all processes must be executed within Singularity containers. See nextflow.config for the profile specification.

The profile is enabled using the -profile parameter to nextflow:

nextflow run -profile dardel <nextflow_script>

A NAISS compute allocation should also be supplied using the --project parameter.

Workflow organization

The workflows in this folder manage the execution of your analyses from beginning to end.

workflow/
 | - .github/                        Github data such as actions to run
 | - assets/                         Workflow assets such as test samplesheets
 | - bin/                            Custom workflow scripts
 | - configs/                        Configuration files that govern workflow execution
 | - dockerfiles/                    Custom container definition files
 | - docs/                           Workflow usage and interpretation information
 | - modules/                        Process definitions for tools used in the workflow
 | - subworkflows/                   Custom workflows for different stages of the main analysis
 | - tests/                          Workflow tests
 | - main.nf                         The primary analysis script
 | - nextflow.config                 General Nextflow configuration
 \ - modules.json                    nf-core file which tracks modules/subworkflows from nf-core