A Nextflow pipeline for Variant Calling Analysis with NGS RNA-Seq data based on GATK best practices.
Install Nextflow by using the following command:
curl -s https://get.nextflow.io | bash
Download the Docker image with this command (optional) :
docker pull cbcrg/callings-nf:gatk4
Launch the pipeline execution with the following command:
nextflow run CRG-CNAG/CalliNGS-NF -profile docker
Note: the Docker image contains all the required dependencies. Add the -profile docker
to enable the containerised execution to the example command line shown below.
The RNA sequencing (RNA-seq) data, in additional to the expression information, can be used to obtain somatic variants present in the genes of the analysed organism. The CalliNGS-NF pipeline processes RNAseq data to obtain small variants(SNVs), single polymorphisms (SNPs) and small INDELs (insertions, deletions). The pipeline is an implementation of the GATK best practices for variant calling on RNAseq and includes all major steps of the analysis, link.
In addition to the GATK best practics, the pipeline includes steps to compare obtained SNVs with known variants and to calculate allele specific counts for the overlapped SNVs.
The CalliNGS-NF pipeline needs as the input following files:
*.fastq
*.fa
*.vcf
*.bed
The RNAseq read file names should match the following naming convention: sampleID{1,2}_{1,2}.extension
where:
fq
, fq.gz
, fastq.gz
, etc. example: ENCSR000COQ1_2.fastq.gz
.
--reads
$baseDir/data/reads/rep1_{1,2}.fq.gz
Example:
$ nextflow run CRG-CNAG/CalliNGS-NF --reads '/home/dataset/*_{1,2}.fq.gz'
--genome
.fa
.$baseDir/data/genome.fa
.Example:
$ nextflow run CRG-CNAG/CalliNGS-NF --genome /home/user/my_genome/human.fa
--variants
.vcf
or vcf.gz
.$baseDir/data/known_variants.vcf.gz
.Example:
$ nextflow run CRG-CNAG/CalliNGS-NF --variants /home/user/data/variants.vcf
--denylist
(formely --blacklist
).bed
.$baseDir/data/denylist.bed
.Example:
$ nextflow run CRG-CNAG/CalliNGS-NF --denylist /home/user/data/denylisted_regions.bed
--results
results
Example:
$ nextflow run CRG-CNAG/CalliNGS-NF --results /home/user/my_results
For each sample the pipeline creates a folder named sampleID
inside the directory specified by using the --results
command line option (default: results
).
Here is a brief description of output files created for each sample:
file | description |
---|---|
final.vcf |
somatic SNVs called from the RNAseq data |
diff.sites_in_files |
comparison of the SNVs from RNAseq data with the set of known variants |
known_snps.vcf |
SNVs that are common between RNAseq calls and known variants |
ASE.tsv |
allele counts at a positions of SNVs (only for common SNVs) |
AF.histogram.pdf |
a histogram plot for allele frequency (only for common SNVs) |
Note: CalliNGS-NF can be used without a container engine by installing in your system all the required software components reported in the following section. See the included Dockerfile for the configuration details.
CalliNGS-NF uses the following software components and tools: