fabfabBesnard / Parallel-Variant-Calling

0 stars 1 forks source link

Parallel genomic variant calling and prediction of functional impacts

Improve and speed-up your forward genetics by listing all mutations in several samples at a time !

Content

  1. Introduction
  2. Pipeline overview
  3. Requirements
  4. General Usage
  5. Input files and parameters
  6. Example

Introduction

This pipeline is designed to automatically provide the most exhaustive and accurate possible list of genes affected by genomic variations (e.g. natural polymorphisms, mutations) specific to a sample's DNA, using Illumina paired-end sequencing data.

The output is a user-friendly tsv table that can be parsed and filter with classical spreadsheet software (LibreOffice, Excel, ...). This table is sorted by gene and predicts functional impact(s) on the gene of the identified genomic variation(s) to help biologists find the best candidates genes modified in the sample(s) provided.

The two main strengths of this pipeline are:

The pipeline is implemented in Nextflow: it's very easy to install and allows to monitor the completion of all processes of the pipeline, can be deployed in clusters/clouds for parallel computing, it ensures reproducible analysis (simple configuration, supports Docker technology, keeps track of command lines and parameters), promotes efficient re-run and debugs, generates reports.

In theory, it can be used on every organisms for which a reference genome and annotation files are available (flexible input data provided by the user). Organisms in which the pipeline has been tested: Arabidopsis thaliana, Physcomitrium patens.

Pipeline overview (technical description)

This pipeline has 3 major parts:

Flowchart

Installation

Just get the present git repository and execute the VariantCaller.nf script using Nextflow with the options and paramaters you need. Check the General Usage section to see how to execute correctly the script : https://github.com/Romumrn/Pipeline_variant_RDP#general-usage.

Requirements

NextFlow

The pipeline script can be executed using nextflow. You can find all the information you need on the links below : https://www.nextflow.io/docs/latest/getstarted.html

note

As explained in the link, we recommand to place the nextflow binary executable file at a location accessible to your $PATH. Alternatively, you can permanently edit your $PATH. For example, in a bash shell, execute: echo 'export PATH=$PATH:/path/to/nextflow' >> ~/.bash_profile (in a sh and ksh shell, echo 'export PATH=$PATH:/path/to/nextflow' >> ~/.profile)

Conda

In order to let NextFlow use the virtual environnement for the pipeline with the docker_conda profile, you need to have anaconda. You can find all the information you need on the links below : https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation

Linux

https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Windows

https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html

macOS

https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html

note

Be sur that the NextFlow script is executed in a conda environment : The prompt is supposed to changed to (base) or the conda command is supposed to be recognize by your sheel.

Docker

The second tool you need is Docker. You can find all the information you need on the links below : https://docs.docker.com/engine/install/

Input data

The sequencing data are typically Illumina paired-end sequencing fastq files generated from the genomic DNA of a unique strain. Retained genomic variations are only homozygous positions. Both haploid and diploid organisms can be studied.

The sequence of the reference genome of the organism is required at the fasta format, annotation file is required at the gff format.

General Usage

So far, we have implemented a single workflow called VariantCaller.nf. This pipeline can be called using the following command:

nextflow run -profile [psmn,singularity,docker_conda] VariantCaller.nf \
--reads "reads/*_{1,2}.fq.gz" \
--genomefasta genome.fa \
--annotation genome_annotation.gff \
--outdir My_analyse \

Note

if nextflow is not contained in a folder of your $PATH, modify the start of the above command line with ./path/to/nextflow

We recommand to run nextflow using a screen or tmux terminal to avoid killing the workflow if the terminal closes by accident.

Example with tmux :

# Open a new terminal
tmux
# run nextflow
nextflow run -c -profile
# "Detach" terminal (the terminal vanishes from the screen but do not stop, so the running process is not killed)
ctrl+B (simultaneous keys) then type 'd'
# "Attach" terminal (re-open the remote terminal session which is still active)
tmux attach

Input files and parameters

Configuration file

By default, the script use nextflow.config as confifuration file . This file needs to be in the same folder as VariantCaller.nf in order to work correctly.

note

Currently, if you execute the pipeline in local mode, use the docker_conda profile using the -profile option. This one use Conda and Docker to execute the pipeline in corrects environnements when it calls the differents softwares.

Main parameters

or

Important note :

Some databases are already included in our snpEff docker image : Arabidopsis thaliana, _Physcomitrellapatens, Caenorhabditis elegans, Caenorhabditis briggsae, Populus trichocarpa, Zea mays, Drosophila melanogaster, Saccharomyces cerevisiae and _Schizosaccharomycespombe. For these species you do not need to provide an annotation file: instead, use the option --annotationname followed by the organism's name recognized by snpEff:

  • --annotationname : + name (string). Name of the organism (with a underscore instead of space) either Arabidopsis_thaliana, Physcomitrella_patens, Caenorhabditis_elegans, Caenorhabditis_briggsae, Populus_trichocarpa, Saccharomyces_cerevisiae, Zea_mays, Drosophila_melanogaster, Schizosaccharomyces_pombe

If your model organism is not already included, you can provide an annotation file. Warning ! It can be risky because annotation files can be not well formatted and not supported by the snpEff program.

Optional parameters:

Example

Command line :

./nextflow run script/VariantCaller.nf -profile docker_conda \
--reads "/home/rmarin/Mydata/V30001743*{1,2}.fq.gz" \
--genomefasta Mydata/Arabidopsis_thaliana.TAIR10.31.dna.toplevel.fa \
--vqsrfile Mydata/1001genomes_snp-short-indel_only_ACGTN.vcf.gz \
--annotationname 'Arabidopsis_thaliana' \
--outdir My_analyse

with this command line, the beginning of the configuration file specifying the main parameters should look like:

Configuration file:

// Configuration file 

params {
  // Path to directory who contain script. For example /Pipeline_variant_RDP/script.nf'
  scriptdir = 'script' 

  // Full path to file of reference genome (.fa or .fasta or .gz)
  genomefasta = false //given in the command line

  //Full path to directory and name of reads in fastq.gz. You can use a pattern to select several files. Example : Sequence*_{1,2}.fastq
  reads = false //given in the command line

  //You can provide your own annotation file with `--annotationgff` or you can use snpeff database. 
  //Only for Physcomitrella patens or Arabidopsis thaliana ! In this case use `--annotationname`
  annotationgff = false // Full path to file of annotation genome (.gff)
  annotationname = false // [ 'Physcomitrella_patens' , 'Arabidopsis_thaliana', or false ]

  //Number of chromosome copy. Default: 1 (1 or 2)
  ploidy = 2 // 1 or 2, set to 2 for Arabidopsis thaliana here

  //You can provide a reference of good variant file (.vcf) in order to apply a variant recalibartion score. 
  //For exemple [Arabidopsis_thaliana]( https://1001genomes.org/data/GMI-MPI/releases/v3.1/)
  vqsrfile = false // given in command line
  vqsrrate = 99.0

  sampletable = false
  outdir = 'results' //overwritten by command line argument
  minglobalqual = 200 // per cohort
  mindepth = 4 // per sample
  sample = false

  help = false
}

Corresponding organization of input files, scripts, and nextflow executable:

.
├── Mydata
│   ├── 1001genomes_snp-short-indel_only_ACGTN.vcf
│   ├── Arabidopsis_thaliana.TAIR10.31.dna.toplevel.fa
│   ├── V300017433_L4_B5GARAwyvRAAAAABA-501_1.fq.gz
│   ├── V300017433_L4_B5GARAwyvRAAAAABA-501_2.fq.gz
│   ├── V300017433_L4_B5GARAwyvRAAABABA-502_1.fq.gz
│   └── V300017433_L4_B5GARAwyvRAAABABA-502_2.fq.gz
├── script
│   ├── breakdancer2vcf.py
│   ├── extract_specific.py
│   ├── extract_specific_SV.py
│   ├── extractSplitReads_BwaMem
│   ├── extractsvlen.py
│   ├── fastq_sample.py
│   ├── graph_qual_fromraw copy.py
│   ├── graph_qual_fromraw.py
│   ├── splitfa.py
│   ├── testmappingbowtie2.nf
│   ├── testmapping.nf
│   ├── VariantCaller.config
│   └── VariantCaller.nf
└── nextflow