Clinical research pipeline for exploring variants in whole genome (WGS) and exome (WES) sequencing data
crg2 is a research pipeline aimed at discovering clinically relevant variants in whole genome and exome sequencing data. It aims to provide reproducible results, be computationally efficient, and transparent in it's workflow.
crg2 uses Snakemake and Conda to manage jobs and software dependencies.
A note about the config files: the values in config_hpf.yaml refer to tool/filepaths on SickKid's HPC4Health tenancy (hpf), while the values in config_cheo_ri.yaml refer to tool/filepaths on CHEO's HPC4Health tenancy. For simplicity, we refer below only to config_hpf.yaml; if you are running crg2 on CHEO's tenancy, use config_cheo_ri.yaml instead.
If you are running crg2 on the hpf, skip to the 'Running the pipeline' section.
Download and setup Anaconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
Install Snakemake 5.10.0 via Conda: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
Git clone this repo, crg, and cre. crg2 uses various scripts from other two repos to generate final reports.
Make a directory for Conda to install all its environments and executables in, for example:
mkdir ~/crg2-conda
Navigate to the crg2 directory. Install all software dependencies using:
cd crg2
snakemake --use-conda -s Snakefile --conda-prefix ~/crg2-conda --create-envs-only
cd crg2
snakemake --use-conda -s cre.Snakefile --conda-prefix ~/crg2-conda --create-envs-only
Make sure to replace ~/crg2-conda
with the path made in step 4. This will take a while.
Install these plugins for VEP: LoF, MaxEntScan, SpliceRegion
. Refer to this page for installation instructions: https://useast.ensembl.org/info/docs/tools/vep/script/vep_plugins.html. The INSTALL.pl script has been renamed to vep_install in the VEP's Conda build. It is located in the conda environment directory, under share/ensembl-vep-99.2-0/vep_install
. Therefore, your command should be similar to: fb5f2eb3/share/ensembl-vep-99.2-0/vep_install -a p --PLUGINS LoF,MaxEntScan,SpliceRegion
Git clone cre: git clone https://github.com/ccmbioinfo/cre
to a safe place.
Replace the VEP paths to the VEP directory installed from step 6. Replace the cre path in crg2/config_hpf.yaml with the one from step 7.
AnnotSV 2.1 is required for SV report generation.
wget https://lbgi.fr/AnnotSV/Sources/AnnotSV_2.1.tar.gz
tar -xzvf AnnotSV_2.1.tar.gz
export ANNOTSV=/path_of_AnnotSV_installation/bin
-bedtools: bedtools
-overlap: 50
-reciprocal yes
-svtBEDcol: 4
To generate a gene panel from an HPO text file exported from PhenomeCentral or G4RD, add the HPO filepath to config["run"]["hpo"]
. You will also need to generate Ensembl and RefSeq gene files as well as an HGNC gene mapping file.
wget -qO- http://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz | gunzip -c > Homo_sapiens.GRCh37.87.gtf
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz | gunzip -c > GRCh37_latest_genomic.gff
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_assembly_report.txt
python scripts/clean_gtf.py --ensembl_gtf /path/to/Homo_sapiens.GRCh37.87.gtf --refseq_gff3 /path/to/GRCh37_latest_genomic.gff --refseq_assembly /path/to/GRCh37_latest_assembly_report.txt
config["gene"]["ensembl"]
and config["gene"]["refseq"]
fields.config["gene"]["hgnc"]
.You will need to have R in your $PATH to generate the str (EH and EHDN) reports. You will also need to install the following packages: dbscan, doSNOW.
$ R
> install.packages("dbscan")
> install.packages("doSNOW")
To generate a mobile element insertion report (WGS only), MELT installation is required and some paths must be added to config_hpf.yaml:
tar zxf MELTvX.X.tar.gz
This should create a MELTvX.X directory in your current directory. config[“tools”][”melt”]
.ls <full_path_to>/MELTv2.2.2/me_refs/1KGP_Hg19/*_MELT.zip > transposon_file_list.txt
config[“ref”][“melt_element_ref”]
.config[“annotation”][“melt”][“genes”]
. The file can be found at:
<full_path_to>/MELTv2.2.2/add_bed_files/1KGP_Hg19/hg19.genes.bed
Map fastqs to the human decoy genome GRCh37d5
Picard MarkDuplicates, but don't remove reads
GATK4 base recalibration
Remove reads mapped to decoy chromosomes
Call SNV's and generate gVCFs
Merge gVCF's and perform joint genotyping
Filter against GATK best practices filters
Decompose multiallelics, sort and uniq the filtered VCF using vt
Annotate using vcfanno and VEP
Generate a gemini db using vcf2db.py
Generate a cre report using cre.sh
Call variants using GATK, Freebayes, Platypus, and SAMTools.
Apply caller specific filters and retain PASS variants
Decompose multiallelics, sort and uniq filtered VCF using vt
Retain variants called by GATK or 2 other callers; Annotate caller info in VCF with INFO/CALLER and INFO/NUMCALLS.
Annotate using vcfanno and VEP
Generate a gemini db using vcf2db.py
Generate a cre report using cre.sh
Repeat the above 1-7 steps using GATK MUTECT2 variant calls to generate a mosaic variant report
Call SV's using Manta, Smoove and Wham
Merge calls using MetaSV
Annotate VCF using snpEff and SVScores
Split multi-sample VCF into individual sample VCFs
Generate an annotated report (produces filtered, i.e. SV called by at least two callers, and unfiltered reports)
Filter Manta SV calls to exclude all variants but BNDs
Annotate VCF using snpEff
Generate an annotated BND report
A. ExpansionHunter: known repeat location
B. ExpansionHunterDenovo: denovo repeats
Make a folder in a directory with sufficient space. Copy over the template files crg2/samples.tsv, crg2/units.tsv, crg2/config_hpf.yaml, crg2/dnaseq_slurm_hpf.sh, crg2/slurm_profile/slurm-config.yaml . You may need to re-copy config_hpf.yaml and slurm-config.yaml if the files were recently updated in the repo from previous crg2 runs. Note that 'slurm-config.yaml' is for submitting each rule as cluster jobs, so ignore this if not running on cluster.
mkdir NA12878
cp crg2/samples.tsv crg2/units.tsv crg2/config_hpf.yaml crg2/dnaseq_slurm_hpf.sh crg2/slurm_profile/slurm-config.yaml NA12878
cd NA12878
Set up pipeline run
project
to refer to the family ID (here, NA12878).pipeline
to wes
, wgs
, annot
or mity
for exome sequences, whole genome sequences, to simply annotate a VCF respectively or to generate mitochondrial reportspanel
) or hpofile (hpo
) will generate 2 SNV reports with all variants falling within these regions, one which includes variants in flanking regions as specified by flank
.ped
file with parents and a proband(s) will allow generation of a genome-wide de novo report if pipeline
is wgs
.minio
refers to the path of the file system mount that backs MinIO in the CHEO HPC4Health tenancy. Exome reports (and coverage reports, if duplication percentage is >20%) will be copied to this path if specified. samples.tsv
sample
NA12878
units.tsv
sample platform fq1 fq2 bam cram
NA12878 ILLUMINA /hpf/largeprojects/ccmbio/GIAB_benchmark_datasets/hg19/WGS/NA12878/RMNISTHS_30xdownsample.bam
config_hpf.yaml
run:
project: "NA12878"
samples: samples.tsv
units: units.tsv
ped: "" # leave this string empty if there is no ped
panel: "" # three-column BED file based on hpo file; leave this string empty if there is no panel
hpo: "" # five-column TSV with HPO terms; leave this string empty is there are no hpo terms
flank: 100000
gatk: "gatk"
pipeline: "wgs" #either wes (exomes) or wgs (genomes) or annot (to annotate and produce reports for an input vcf) or mity (to generate mitochondrial reports)
minio: ""
PT_credentials: ""
...
(base) [dennis.kao@qlogin5 crg2]$ conda activate snakemake
(snakemake) [dennis.kao@qlogin5 crg2]$ snakemake -v
5.10.0
(snakemake) [dennis.kao@qlogin5 crg2]$ sh dnaseq_slurm_hpf.sh
Building DAG of jobs...
Job counts:
count jobs
1 EH
1 EH_report
1 EHdn
1 EHdn_report
1 all
1 allsnvreport
1 bam_to_cram
1 bcftools_stats
1 bgzip
25 call_variants
25 combine_calls
1 fastq_screen
1 fastqc
25 genotype_variants
2 hard_filter_calls
1 input_prep
1 manta
1 map_reads
1 mark_duplicates
1 md5
1 merge_calls
1 merge_variants
1 metasv
1 mito_vcfanno
1 mity_call
1 mity_normalise
1 mity_report
1 mosdepth
1 multiqc
1 pass
1 peddy
1 qualimap
1 recalibrate_base_qualities
1 remove_decoy
3 samtools_index
1 samtools_index_cram
1 samtools_stats
2 select_calls
1 smoove
2 snpeff
1 subset
1 svreport
2 svscore
1 tabix
1 vcf2db
1 vcfanno
1 vep
1 verifybamid2
1 vt
1 wham
1 write_version
129
[rule inputs and outputs removed for brevity]
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
...
Submit pipeline job to the cluster
Job scheduler: PBS on SickKids hpf
qsub dnaseq.pbs
qsub dnaseq_cluster.pbs
Refer to pbs_profile/cluster.md
document for detailed documentation for cluster integration.
Job scheduler: Slurm
sbatch dnaseq_slurm_hpf.sh
sbatch dnaseq_slurm_cheo_ri.sh
dnaseq_slurm_api.sh
is called by Stager when exome analyses are requested. It automatically sets up (exome_setup_stager.py
) and kicks off the crg2 WES pipeline via the slurm API using linked files that have been uploaded to MinIO.
Column descriptions and more info on how variants are filtered can be found in the CCM Sharepoint.
The WES pipeline generates 5 reports:
wes.regular - report on coding SNVs in exonic regions
wes.synonymous - report on synonymous SNVs in exonic regions
clinical.wes.regular - same as wes.regular but with more stringent filters
clinical.wes.synonymous - same as wes.synonymous but with more stringent filters
wes.mosaic - putative mosaic variants
The WGS pipeline generates up to 11 reports:
The SNV reports can be found in the directories:
The SV reports can be found in the directory (SV, unfiltered SV, and BND):
The STR reports can be found in:
The mitochondrial report can be found in the directory:
The MELT mobile element report can be found in the directory:
parser_genomes.py
script can be used to automate the above process for a batch of genomes.
usage: parser_genomes.py [-h] -f FILE -s {fastq,mapped,recal,decoy_rm} -d path
Reads sample info from TSV file (-f) and creates directory (-d) necessary to start crg2 from step (-s) requested.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Five column TAB-seperated sample info file; template sample file: crg2/sample_info.tsv
-s {fastq,mapped,recal,decoy_rm}, --step {fastq,mapped,recal,decoy_rm}
start running from this folder creation(step)
-d path, --dir path Absolute path where crg2 directory struture will be created under familyID as base directory
The script performs the following operations for each familyID present in the sample info file
exome_reanalysis.py
script can be used to automate the above process for exome re-analysis.
usage: exome_reanalysis.py [-h] -f FILE -d path
Reads sample info from Stager analysis csv file (-f) and creates directory (-d) necessary to run crg2.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Analyses csv output from STAGER
-d path, --dir path Absolute path where crg2 directory structure will be created under familyID/analysisID as base directory
The script parses an analysis request csv from Stager for exome re-analyses and sets up necessary directories (under 2nd argument), files as below:
genome_reanalysis.py
script can be used to automate the above process for genome re-analysis.
usage: genome_reanalysis.py [-h] -f FILE -d path
Reads sample info from Stager analysis csv file (-f) and creates directory (-d) necessary to run crg2.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Analyses csv output from STAGER
-d path, --dir path Absolute path where crg2 directory structure will be created under familyID as base directory
The script parses an analysis request csv from Stager for genome re-analyses and sets up necessary directories (under 2nd argument), files as below:
The following output files are not included in the main Snakefile and can be requested in snakemake
command-line.
HPO annotated reports:
Reports from coding, panel and panel-flank can be annotated with HPO terms whenever HPO file is available with us. This is done mainly for monthly GenomeRounds. HPO annotated TSV files are created in directory: report/hpo_annotated
using the following command:
snakemake --use-conda -s $SF --conda-prefix $CP --profile ${PBS} -p report/hpo_annotated
The reason to not include this output in Snakefile is that the output of rule allsnvreport
is a directory and hence snakemake will not check for the creation of final csv reports. So, users are required to make sure the csv reports are created in the three folders above, and then request for the output of rule annotate_hpo
separately on command-line (or append to the dnaseq_cluster.pbs).
CNV and SV comparison outputs are not yet part of the pipeline. Please follow the steps 7 & 8 in crg to generate the following three TSVs (this is also required for GenomeRounds)
SNV calls from WES and WGS pipeline can be benchmarked using the GIAB dataset _HG001NA12878 (family_sample) and truth calls from NISTv3.3.2
units.tsv
are downsampled for testing purposes. Edit the tsv to use the inputs from HPF: ccmmarvin_shared/validation/benchmarking/benchmark-datasets
or CHEO-RI: /srv/shared/data/benchmarking-datasets/HG001
. crg2/benchmark_hpf.tsv
or crg2/benchmark_cheo_ri.tsv
to current directory. Note: benchmark_x.tsv uses HG001_NA12878 as family_sample name, so you should edit the "project" name in config_hpf.yaml
config_hpf.yaml
to set "wes" or "wgs" pipelinednaseq_slurm_hpf.sh
to include the target validation/HG001
A guideline to developing crg2 can be found in this document.