Snakemake workflow to analyse hematological malignancies in whole genome data
This snakemake workflow uses modules from hydragenetics to process .fastq
files and obtain different kind
of variants (SNV, indels, CNV, SV). Alongside diagnosis-filtered .vcf
files, the workflow produces a
multiqc report .html
file and some CNV plots and .tsv
files with relevant information from mutect and Manta. One of the modules contains the commercial
parabricks toolkit which can be replaced by sentieon or
opensource GATK tools if required. The following modules are currently part of this pipeline:
In order to use this module, the following dependencies are required:
Input data should be added to
samples.tsv
and
units.tsv
.
The following information need to be added to these files:
Column Id | Description |
---|---|
samples.tsv |
|
sample | unique sample/patient id, one per row |
tumor_content | ratio of tumor cells to total cells |
units.tsv |
|
sample | same sample/patient id as in samples.tsv |
type | data type identifier (one letter), can be one of Tumor, Normal, RNA |
platform | type of sequencing platform, e.g. NovaSeq |
machine | specific machine id, e.g. NovaSeq instruments have @Axxxxx |
flowcell | identifer of flowcell used |
lane | flowcell lane number |
barcode | sequence library barcode/index, connect forward and reverse indices by + , e.g. ATGC+ATGC |
fastq1/2 | absolute path to forward and reverse reads |
adapter | adapter sequences to be trimmed, separated by comma |
Reference files should be specified in
config.yaml
.fasta
reference file of the human genome is required as well as an .fai
file and an bwa index of this
file..vcf
file containing known indel sites. For GRCh38, this file is available as part of the Broad GATK
resource bundle at
google cloud..interval_list
file containing all whole genome calling regions. The GRCh38 version is also available at
google cloud.trimmer_software
should be specified by indicating a rule which should be used for trimming. This
pipeline uses fastp_pe
..bed
files defining regions of interest for different diagnoses. This pipeline is assuming ALL
and AML
and different gene lists for SNVs and SVs..bed
file containing the region that the analysis should be limited to.config.yaml
.To run the workflow,
resources.yaml
is needed which defines different resources as default and for different rules. For parabricks, the gres
stanza is needed and should specify the number of GPUs available. You also need a config.yaml
where all run-variables are defined.
snakemake --profile my-profile --configfile config/config.yaml
To run the integration test you only need to add lines in the tests/integration/config.yaml
that differs from the original config.yaml
. As of now it is only a dryrun test, no small dataset is available.
cd .tests/integration/
snakemake --snakefile ../../workflow/Snakefile --configfiles ../../config/config.yaml config.yaml -n
.fastq files are archived as compressed file pair as .spring: Archive/{project}/{sample}_{flowcell}_{lane}_{barcode}_{type}.spring
The MultiQC html report can be found here: Results/MultiQC_TN.html
All results (as described in table below) are located in: Results/{project}/{sample}/
File | Description |
---|---|
Cram/{sample}_{type}.crumble.cram |
crumbled .cram file |
Cram/{sample}_{type}.crumble.cram.crai |
index for crumbled .cram file |
SNV_indels/{sample}_T.vep.vcf.gz |
.vcf output for SNV and small indels annotated with VEP for tumor_only |
SNV_indels/{sample}_T.vep.vcf.gz.tbi |
index for .vcf output from VEP for tumor_only |
SNV_indels/{sample}_TN.vep.vcf.gz |
.vcf output from VEP for tumor/normal |
SNV_indels/{sample}_TN.vep.vcf.gz.tbi |
index for .vcf output from VEP for tumor/normal |
SNV_indels/{sample}_TN.vep.all.vcf.gz |
.vcf output from VEP for tumor/normal, hard-filtered for ALL genes |
SNV_indels/{sample}_TN.vep.all.vcf.gz.tbi |
index for .vcf output from VEP for tumor/normal, hard-filtered for ALL genes |
SNV_indels/{sample}_TN.vep.aml.vcf.gz |
.vcf output from VEP for tumor/normal, hard-filtered for AML genes |
SNV_indels/{sample}_TN.vep.aml.vcf.gz.tbi |
index for .vcf output from VEP for tumor/normal, hard-filtered for AML genes |
SNV_indels/{sample}_mutectcaller_TN.all.tsv |
.tsv file for excel containing SNVs and small indels from mutect2 for ALL |
SNV_indels/{sample}_mutectcaller_TN.aml.tsv |
.tsv file for excel containing SNVs and small indels from mutect2 for AML |
SNV_indels/{sample}.pindel.vcf.gz |
.vcf output from pindel |
SNV_indels/{sample}.pindel.vcf.gz.tbi |
index for .vcf output from pindel |
CNV/{sample}_T.vcf.gz |
.vcf output from cnvkit |
CNV/{sample}_T.vcf.gz.tbi |
index for .vcf output from cnvkit |
CNV/{sample}_{type}.CNV.xlsx |
Excel file containing overview of CNVkit results |
CNV/{sample}_T.png |
scatter plot from cnvkit for entire genome |
CNV/{sample}_T_chr{chr}.png |
scatter plot per chromosome from cnvkit |
SV/{sample}_manta_TN.ssa.vcf.gz |
.vcf output from Manta |
SV/{sample}_manta_TN.ssa.vcf.gz.tbi |
index for .vcf output from Manta |
SV/{sample}_manta_TN.ssa.all.vcf.gz |
.vcf output from Manta filtered for ALL genes |
SV/{sample}_manta_TN.ssa.all.vcf.gz.tbi |
index for .vcf output from Manta filtered for ALL genes |
SV/{sample}_manta_TN.ssa.aml.vcf.gz |
.vcf output from Manta filtered for AML genes |
SV/{sample}_manta_TN.ssa.aml.vcf.gz.tbi |
index for .vcf output from Manta filtered for AML genes |
SV/{sample}_manta_TN.del.tsv |
.tsv file for excel containing deletions found by Manta (filtered) |
SV/{sample}_manta_TN.ins.tsv |
.tsv file for excel containing insertions found by Manta (filtered) |
SV/{sample}_manta_TN.dup.tsv |
.tsv file for excel containing duplications found by Manta (filtered) |
SV/{sample}_manta_TN.bnd.tsv |
.tsv file for excel containing breakends found by Manta (filtered) |
SV/{sample}_manta_TN.bnd.all.tsv |
.tsv file for excel containing breakends found by Manta (filtered), filtered for ALL genes |
SV/{sample}_manta_TN.bnd.aml.tsv |
.tsv file for excel containing breakends found by Manta (filtered), filtered for AML genes |
The general statistics table are ordered based on the "s-index" in fastq-filename. This is done by renaming the samples in two steps using the script sample_order_multiqc.py
. To toggle between "Sample Order" and "Sample Name" use the buttons just above General Stats header.
Column Name | Origin | Comment |
---|---|---|
M Reads | Picard HSMetrics | Total number of reads in inputfile (alignment/samtools_merge_bam/{sample}_{type}.bam ) |
% Mapped | Samtools stats | Only reads on target (config[reference][design_bed] ) |
% Proper pairs | Samtools stats | Only reads on target (config[reference][design_bed] ) |
Average Quality | Samtools stats | Ratio between sum of base quality over total length. Only reads on target (config[reference][design_bed] ) |
Median | Mosdepth | Median Coverage over reference |
>= 10X | Mosdepth | Fraction of reference with coverage over 10x |
>= 30X | Mosdepth | Fraction of reference with coverage over 30x |
>=50X | Mosdepth | Fraction of reference with coverage over 50x |
Error sex check | Peddy | Result of sex check based on sex in units.tsv |
Predicted sex sex check | Peddy | |
Bases on Target | Picard HSMetrics | Bases inside the capture design (config[reference][design_intervals] ) |
Fold80 | Picard HSMetrics | The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets (config[reference][design_intervals] ) |
% Dups | Picard DuplicationMetrics | |
Mean Insert Size | Picard InsertSizeMetrics | |
Target Bases with zero coverage [%] | Picard HSMetrics | Percent target (config[reference][design_intervals] ) bases with 0 coverage |
% Adapter | fastp |
default container: docker://hydragenetics/common:0.1.9
Program | Version | Container |
---|---|---|
Arriba | 2.3.0 | docker://hydragenetics/arriba:2.3.0 |
CNVkit | 0.9.9 | docker://hydragenetics/cnvkit:0.9.9 docker://python:3.9.9-slim-buster |
Crumble | 0.8.3 | docker://hydragenetics/crumble:0.8.3 |
fastp | 0.20.1 | docker://hydragenetics/fastp:0.20.1 |
FastQC | 0.11.9 | docker://hydragenetics/fastqc:0.11.9 |
FusionCatcher | 1.33 | docker://blcdsdockerregistry/fusioncatcher:1.33 |
GATK | 4.2.2.0 | docker://hydragenetics/gatk4:4.2.2.0 |
Manta | 1.6.0 | docker://hydragenetics/manta:1.6.0 |
Mosdepth | 0.3.2 | docker://hydragenetics/mosdepth:0.3.2 |
MultiQC | 1.11 | docker://hydragenetics/multiqc:1.11 |
Parabricks | 4.0.0-1 | docker://nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 |
Peddy | 0.4.8 | docker://hydragenetics/peddy:0.4.8 |
Picard | 2.25.0 | docker://hydragenetics/picard:2.25.0 |
Pindel | 0.2.5b9 | docker://hydragenetics/pindel:0.2.5b9 |
RSeQC | 4.0.0 | docker://hydragenetics/rseqc:4.0.0 |
simple_sv_annotation.py | 2019.02.18 | docker://hydragenetics/simple_sv_annotation:2019.02.18 |
snpEff | 5.0 | docker://hydragenetics/snpeff:5.0 |
SortMeRNA | 4.3.4 | docker://hydragenetics/sortmerna:4.3.4 |
SPRING | 1.0.1 | docker://hydragenetics/spring:1.0.1 |
STAR | 2.7.10a | docker://hydragenetics/star:2.7.10a |
STAR-Fusion | 1.10.1 | docker://trinityctat/starfusion:1.10.1 |
svdb | 2.6.0 | docker://hydragenetics/svdb:2.6.0 |
VEP | 105 | docker://hydragenetics/vep:105 |