EI-CoreBioinformatics/CoassemblyPipeline

CoassemblyPipeline

Snakemake pipeline to generate and QC genome (co)assemblies from single-cell (e.g., G&T-Seq) or metagenome data.

THIS PIPELINE DOES NOT TRIM READS. IT EXPECTS THE INPUT READS TO HAVE ALREADY BEEN ADAPTER TRIMMED

The input to this pipeline is a comma separated file. The first line specifies the name of the coassembly, the following lines contain 4 comma separated values to specify the gDNA and cDNA libraries:

sample name
library type (gDNA or cDNA)
path to forward reads
path to reverse reads

Sample names should not contain periods

Example input file: input.csv

COASSEMBLY_NAME
Sample1,gDNA,Sample1_gDNA_R1.fq.gz,Sample1_gDNA_R2.fq.gz
Sample1,cDNA,Sample1_cDNA_R1.fg.gz,Sample1_cDNA_R2.fq.gz
Sample2,gDNA,Sample2_gDNA_R1.fq.gz,Sample2_gDNA_R2.fq.gz
Sample2,cDNA,Sample2_cDNA_R1.fg.gz,Sample2_cDNA_R2.fq.gz
Sample3,gDNA,Sample3_gDNA_R1.fq.gz,Sample3_gDNA_R2.fq.gz
Sample4,cDNA,Sample4_cDNA_R1.fq.gz,Sample4_cDNA_R2.fq.gz

Example config file: config.yaml

# File listing input reads, better to set this at command line with "--config input=XXX.csv"
# Should use absolute paths
input:

# Mode to run SPAdes in - "sc" for single-cell or "meta" for metagenome
assembly_type: [sc / meta]

# Optional tools to run
run_checkm: true
run_busco: true

# SPAdes parameters
kmers: "21,33,55,77"
min_scaffold_length: 1000 # for bbtools reformat

# BUSCO parameters
busco_version: 3
busco_database: XXXXX

# MetaBat2 parameters
metabat2_min_contig: 1500

# Tiara parameters
tiara_min_length: 1000

# EukRep parameters
eukrep_min_length: 1000

# Blobtools parameters
diamond_database: XXXXX
megablast_database: XXXXX
nodesdb: XXXXX
n_chunks: 4 # chunk the fasta file into N splits for the megablastn search; randomly to balance as input fasta will be sorted by sequence length

# CAT parameters
cat_database: XXXXX
taxonomy_dir: XXXXX
diamond_path: XXXXX

# pr2 database
pr2_database: XXXXX

# cleanup options
cleanup_spades: true
cleanup_cat: true

Example command to launch snakemake:

snakemake -p --config input=input.csv -j 20 --retries 1 --latency-wait 60 --snakefile CoassemblyPipeline.smk --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} --mem={cluster.memory} --job-name={cluster.J} --time={cluster.time} --exclude={cluster.exclude} -o slurm_logs/slurm.{cluster.J}.%j.out"

This pipeline runs:

genome assembly using SPAdes in single-cell or metagenome mode
annotation of rRNA genes using barrnap, followed by classification based on comparison with the pr2 database
taxonomic classification using:
- Blobtools
- CAT
- EukRep
- Tiara
assembly stats using QUAST
coverage stats based on mapping reads with minimap2 and QualiMap
contig binning using MetaBAT2
BUSCO
CheckM
Optionally, aligns RNA-Seq reads using hisat2 and assembles transcripts using StringTie2. RNA-Seq reads can come from any source (e.g., single-cell, metatranscriptome, or isolate RNA-Seq)

The Singularity definition file Singularity.def contains most of the required software. Additional software requirements not included in the definition file (due to compatbility issues):

Example DAG

Example Snakemake DAG

EI-CoreBioinformatics / CoassemblyPipeline

readme

CoassemblyPipeline