A bioinformatics package that includes worflows to generate bams and a vcf file from multiplexed GBS paired end fastq files.
This tool utilizes the workflow management system Snakemake and allows you to accomplish a complete pipeline process with just a few command line calls, even on a computing cluster.
An Example Workflow:
ATTENTION Already have demultiplexed fastq's? Try using the Demux Pipeline
This quick use guide is assuming you have followed the installation. If not, proceed first to installations.
Once your environment is set up you can start running the pipeline!
Start at the beginning with Generating BAMs, or if you have BAMs you want to process, move on to generating VCFs!
Take paired end Fastq files (R1 and R2) and produce sorted BAM files and BAM index files
Create and save sample sheet into cluster - generally helpful to put it in same directory as fastq files. For format see samplesheet in supporting file readme.
Change directory to where fastq files are located
$ cd directory/to/samples.fastq.gz
Update the config/config.yaml file. In terminal, input the following commands.
$ nano ~/gbs/GBS_snakemake_pipeline/config/config.txt
manually change the fields for your project (note that the raw fastq files need to be named _R1.fastq.gz and _R2.fastq.gz)
sample_read1: "/input/absolute/path/to/sample_R1.fastq.gz"
samplesheet: "/input/absolute/path/to/samplesheet.txt or samplesheet.csv"
barcodefile: "input/path/to/workflow/resources/barcodes/barcodes.txt"
cntrl-o
to save. (press enter)
cntrl-x
to exit nano editor
Run the pipeline by submitting the process via shell script to the GridEngine queue in the terminal.
$ qsub ~/gbs/GBS_snakemake_pipeline/workflow/gbs.sh
Done!
Notes:
gbs.sh.e(jobnumber)
qstat -f
. If not, check the gbs.sh.e(jobnumber)
for why.Take a list of BAM directories, and produce a single or multiple VCF files from all the BAM files included in those directories.
Update the config/config_vcf.yaml file for the 4 fields.
$ nano ~/gbs/GBS_snakemake_pipeline/config/config_vcf.txt
sample_directories:
- "/absolute/path/to/some/sorted_bams"
- "/absolute/path/to/some/more/sorted_bams"
- "/absolute/path/to/even/more/sorted_bams"
# produce a single whole genome VCF (False) or split into 1 VCF for each chromosome (True) -- case sensitive
split: True
outfile: "/absolute/path/to/output/vcf/directory"
reference: "/absolute/path/to/reference_genome.fasta"
cntrl-o
to save
cntrl-x
to exit
(optional) Create a directory for your VCF output (the same directory name as the config-vcf.yaml output path) and cd into that directory so your snakemake ouput log lives with the VCF file. The name of the VCF will be whatever you name the output folder!
$ mkdir vcf_output/ && cd vcf_output
Run the pipeline by submitting the process via shell script to the GridEngine queue in the terminal.
$ qsub ~/gbs/GBS_snakemake_pipeline/workflow/vcf.sh
Done!
(optional) 4. Process the VCF file Check out the VCF_Analysis_tools ReadMe to see a couple tools to process and visualize your VCF
Notes:
qstat
. Look for vcf.sh
in the list. If not, check the vcf.sh.e(jobnumber)
for why.In your home directory on the cluster/local computer create a directory for the pipeline to live called gbs
$ mkdir gbs
$ cd gbs/
Clone the repository to that gbs folder you created on the cluster
conda install -c anaconda git
on your cluster account.$ git clone https://github.com/elderberry-smells/GBS_snakemake_pipeline.git
press enter
, your local clone should be created with the following pathways for the pipeline:
~/gbs/GBS_snakemake_pipeline/snakefiles
~/gbs/GBS_snakemake_pipeline/config/
~/gbs/GBS_snakemake_pipeline/workflow/envs/
~/gbs/GBS_snakemake_pipeline/workflow/rules/
~/gbs/GBS_snakemake_pipeline/workflow/scripts/
~/gbs/GBS_snakemake_pipeline/workflow/resources/
Install the GBS snakemake environment on your computer
workflow/envs/gbs-spec-file.txt
file from the repo you downloaded, then activate it.
$ conda create --name gbs --file ~/gbs/GBS_snakemake_pipeline/workflow/envs/gbs-spec-file.txt
$ conda activate gbs
(gbs)
on the left side of your command line
4a. Install Novocraft 3.10 if you don't have it (https://www.novocraft.com/support/download/) $ mv ~/gbs/GBS_Snakemake_pipeline/workflow/resources/novocraft ~/miniconda3/envs/gbs/bin
$ nano ~/.bashrc
export PATH="~/miniconda3/envs/gbs/bin/novocraft":$PATH
make the novosort file executable by changing permissions
$ chmod 777 ~/miniconda3/envs/gbs/bin/novocraft/novosort
everything should be installed to run the pipeline now.
all of these dependencies should be now be installed and callable by typing in their names in terminal
python=3.6
snakemake=4.0
trimmomatic=0.39
bwa=0.7.17
samtools=1.9
bcftools=1.8
Novosort=2.00.00
If you have data and a samplesheet to run, test to see if the snakemake process is working
Update the config.yaml
file to direct the pipeline to the resources it needs.
$ nano ~/gbs/GBS_snakemake_pipeline/config/config.yaml
Invoke the snakemake process in test mode to see if outputs are being called correctly.
$ snakemake -s ~/gbs/GBS_snakemake_pipeline/Snakefile -np
the snakemake process should show you all it intends to do with the samples located in the samplesheet, and how many process for each module/rule will run.
example:
Job counts:
count jobs
1 all
184 bwa_map
1 demultiplex
184 novosort
184 trimmomatic
1 fastqc
1 multiqc
1 zipup_demux
1 zipup_unmatched
557
A list of user inputs to run the program can be found in the Supporting Files ReadMe.
This includes the config files, samplesheets, and barcode files.
Refer to the docs for formatting of the files.
Examples for samplesheets can be found in workflow/resources/example_samplesheet.csv
Brief documentation on each step in this pipeline can be found in the Features ReadMe
The protocol for library construction is also included in the resources folder
© His Majesty the King in Right of Canada, represented by the Minister of Agriculture and Agri-Food Canada 2020