This pipeline is used by the Canadian Biogenome project (http://earthbiogenome.ca) to generate genome assemblies from a variety of species.
The pipeline is built using nextflow (https://www.nextflow.io/).
In short, each step of the pipeline is included in a module. Most of the modules uses one container which makes it much easier to maintain and update software dependencies. Some modules rely on locally installed tools. Future updates of the pipeline may include better portability.
A lot of the modules available in this pipeline were developed by members of the nf-core/genomeassembler group, if you want to participate, feel free to join the community.
The pipeline was developped to take as input PacBio files (bam, from Sequel II or Revio machines) and Hi-C files (fastq.gz). The pipeline also support the inclusion of nanopore data and short-reads for polishing.
The pipeline also require the specie NCBI Taxonomy ID, which can be found on GoaT (https://goat.genomehubs.org) or on NCBI.
The pipeline generates many files and intermediate files, most are self explanatory.
An overview of the pipeline is visible on the following subway map. Some parts of the pipeline may have been commented out in this version as they relied on localy installed software. The code is still available in case you also want to locally install the software and try it out.
By default, the pipeline will use hifiasm with PacBio data for the assembly, and if Hi-C data is available, YAHS is used for the scaffolding. Other assembler and scaffolder are available within the pipeline, to change, you need to edit the nextflow.config file.
Software used that would require local installation:
Software that relies on locally downloaded files / databases :
Figure : Overview of the Canadian Biogenome project assembly pipeline
To run this pipeline, you need nextflow and conda or singularity installed on your system.
A set of test data are available in this repo to allow you to test the pipeline with just one command line:
nextflow run bcgsc/Canadian_Biogenome_Project -latest -r V2 -profile conda
The outputs are organized in several subfolder that are self-explenatory.
Clone the repository in your local environment:
git clone https://github.com/bcgsc/Canadian_Biogenome_Project.git
cd Canadian_Biogenome_Project
Modify the nextflow.config file:
Indicate the location of the input file
Indicate the required information
Launch the pipeline
nextflow run main.nf -profile singularity
The pipeline was originnally written by @scorreard with the help and input from :
Members of the Jones lab (Canada's Michael Smith Genome Sciences Centre, Vancouver, Canada).
Members of the Earth Biogenome Project and other affiliated projects.
Members of the nf-core / nextflow community.
The PacBio data is a subset of covid ssequences obtained with this command lines :
wget https://downloads.pacbcloud.com/public/dataset/HiFiViral/Jan_2022/m64187e_211217_130958.hifi_reads.bam
samtools view -b m64187e_211217_130958.hifi_reads.bam -s 123.001 > subset_covid_hifi.bam
The Hi-C data was downloaded from one of the nf-core test dataset
wget https://github.com/nf-core/test-datasets/blob/modules/data/genomics/sarscov2/illumina/fastq/test_1.fastq.gz?raw=true
wget https://github.com/nf-core/test-datasets/blob/modules/data/genomics/sarscov2/illumina/fastq/test_2.fastq.gz?raw=true