cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure
MIT License
7 stars 1 forks source link

Cite with Zenodo cio-abcd/variantinterpretation

GitHub Actions CI Status GitHub Actions Linting StatusAWS CI

Nextflow run with conda run with docker run with singularity Launch on Nextflow Tower

Introduction

The variantinterpretation pipeline is a bioinformatic analysis workflow adding biological and clinical knowledge to genomic variants. It takes as input genomic variants in the variant calling file format (VCF), adds annotations and wraps them into an HTML report and spreadsheet-compatible TSV files. Variants are annotated with information that support molecular biologists and pathologists in interpreting their functional relevance in a biological and clinical context. Further, the pipeline enables variant filtering and deriving meaningful metrics as the tumor mutational burden (TMB).

The pipeline is currently tailored for analyzing somatic single-nucleotide variants (SNVs) and small Insertions and Deletions (InDels). In principle, the workflow was designed to work with all VCF files independent from the originating variant caller. We tested the pipeline with the variant callers mutect2 and freebayes and are happy to get feedback about compatibility or problems with other variant callers.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. Further it provides many options for configuration the pipeline to tailor them to your specific application. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community.

Documentation

  1. Detailed description about the modules, their function, capabilities, limitations and configuration tips can be found in the following chapters of this README.
  2. A Step-by-step guide for setting up and running this workflow that is also suited for beginners with nextflow and nf-core pipelines can be found in the docs/usage.md documentation.
  3. A Output file description from each module can be found in the docs/output.md documentation.
  4. A parameter description with all their defaults, possible values and help texts can be found on the docs/params.md documentation.

Pipeline overview

Pipeline overview

  1. VCF preprocessing
    • Indexing with bctools index.
    • VCF checks
      • Checks samplesheet integrity with modified nf-core script
      • Checks VCF file requirements and integrity with GATK4 ValidateVariants, bcftools and a custom python script.
      • Runs optional BED file format check using a python script
    • Variant normalization: Splitting of multi-allelic into bi-allelic variants and optional left-alignment of InDels using bcftools norm.
    • Pre-annotation VCF filtering based on FILTER column entries using bcftools view.
  2. Variant annotation using the ensembl variant effect predictor (VEP).
  3. Filtering
    • Transcript filtering using the filter_vep script.
    • Custom filters: Create additional filtered VCF files based on preset filter criteria using vembrane tag and vembrane filter.
  4. Reporting
    • TSV conversion based on VCF fields including FORMAT and INFO columns and the VEP annotation fields encoded as CSQ strings in the INFO field using vembrane table.
    • HTML report generation using datavzrd.
    • MultiQC HTML report summarizing general statistics and warning messages.
    • Pipeline information summarizing runtime and other statistics in nf-core pipeline overview.
  5. Tumor mutational burden (TMB) calculation: Based on provided cutoffs and thresholds for each sample using the vembrane TSV output and a python script.

Module description

VCF preprocessing

VCF indexing

Creates tabix index (tbi) for the VCF file that is required for several workflow processes using bcftools index. It requires an uncompressed or bgzip compressed VCF file.

VCF checks

This VCF quality control module checks for integrity of the VCF file and several requirements to the input variants. It additionally can create warnings logged in the multiQC module if the VCF file shows some characteristics that can hamper interpretation.

It uses the following tools:

The following table gives an overview about the criteria that are checked and possible warnings:

criteria log-level description tool
VCF file format ERROR General structure of VCF file needs to stick to the VCF file format. GATK4 ValidateVariants
uncompressed or bgzip compressed ERROR VCF files needs to be either uncompressed or bgzip compressed. Gzip-compressed files give an error during indexing. bcftools index
single-sample VCF ERROR Multi-sample VCF files are currently not supported. bcftools stats
"chr" prefix in CHROM column ERROR The chromosome column CHROM in the VCF file needs to contains the "chr" prefix. Please be aware that the provided reference genome also contains the "chr" prefix. This ensures compatibility with annnotation sources. python script
matching reference genome ERROR Provided VCF file needs to match the provided reference genome. Especially this test can differentiate between GRCh37 and GRCh38 human reference genomes. If left-alignment of indels is activated, bcftools norm also additionally checks the reference genome. GATK4 ValidateVariants
only passed filters WARNING Gives warning if the FILTER column contains entries other than "PASS" or ".". NOTE: These can be removed with the the filter_pass parameter in the VCF preprocessing module. Also, new flags might be added after annotation if additional filters are specified in the Filtering module. python script
no-ALT entries WARNING Gives warning if VCF file contains non-variant positions. Genomic VCF files (gVCFs) are supported in the pipeline but can dramatically increase the runtime of VEP. bcftools stats
no multiallelic sites WARNING Gives warning if the VCF file contains multiallelic variants. NOTE: These will be automatically split with bcftools norm in the VCF preprocessing module. bcftools stats
contains other variants than SNVs and InDels WARNING Gives warning if VCF file contains other variants than SNVs and InDels. bcftools stats
previous VEP annotation present WARNING Gives warning if previous VEP annotation is present. The test checks for VEP in the header and if the INFO column already contains a CSQ key.

Variant normalization

This module uses bcftools norm to split multi-allelic into bi-allelic sites. This step is required by vembrane as it cannot handle multi-allelic records (also see vembrane documentation). Optionally, the left-alignment of InDels can be activated using the --left-align-indels parameter. Note: When left-alignment is enabled, it performs an additional reference genome check (there is already a check in the VCF checks module).

Pre-annotation VCF filter

This is an optional step activated and controlled through the --filter_vcf parameter. It can filter VCF files based on flags in the FILTER column using bcftools view. This can be enabled with the --filter_vcf parameter providing flag names for the VCF FILTER column to be kept (e.g., "PASS"). This step is placed prior to annotation to improve runtime if, e.g., lots of low-quality variants are removed.

:::warning WARNING: Variants filtered in this step will NOT be included in any output file! :::

Ensembl VEP annotation

Variant effect predictor (VEP) annotates variants based on provided public databases. It provides biological information as protein consequence and effect prediction as well as co-located variants from existing databases, e.g., about population allele frequencies. For full overview, see the VEP annotation sources and VEP command flags.

Currently, this workflow only supports annotation with sources from the VEP cache. We plan on adding other databases as annotation source in the future. To add HGVS nomenclature for variants, you need to specify a FASTA file which is therefore a requirement for this pipeline. You can easily enable/disable VEP options using several parameter, also see the parameter documentation docs/params.md.

The databases and their respective version is documentated within the VCF file under the VEP flag. Currently, we use VEP version 110 and hence using the VEP cache from v10 is highly recommended. It contains the following databases:

Filtering

Transcript filtering

The first step after annotation is a (optional) filtering for transcripts using the filter_vep script provided by the VEP software suite. VEP adds all possible transcripts and their annotations (e.g., consequence) to the variant records in the VCF file. For interpretation it is useful to filter for the most relevant or very specific transcripts.

:::note The final TSV and HTML reports include each transcript from a variant as a separate row! For example: If you have a variant with 6 transcripts annotated, this results in 6 separate rows in the final TSV file for each transcript from the same variant. :::

This module has a default external argument --soft_filter to prevent silent dropping of variants. Instead a flag "filter_vep_fail/filter_vep_pass" will be added to the FILTER column. Another default external argument is also --only_matched that will result in annotation being dropped if it does not match the filtering criteria. This results in variants only having transcript annotations if matching the filter criteria, variants without any matching transcript will be retained, but without any annotation.

This module has two options to filter transcripts:

  1. Filter by specific VEP annotation columns using the parameter --transcriptfilter. It only supports specific columns that provide a boolean information about which transcript to include:
  2. Filter by a specific list of transcripts provided by the --transcriptlist parameter. Can be combined with --transcriptfilter. The provided file needs to contain one transcript per row and matches the provided VEP cache transcript definitions (Refseq or Ensembl), see also the downloads section in docs/usage.md. Please note that the annotation will be completely removed for variants that do not match any of the provided transcripts!

:::warning WARNING: Transcripts filtered in this step will not be shown in any output file! :::

Custom filters

This module can create additional reports in which preset filters are applied. These are very useful for interpretation as you can once define complex filters and do not have to enter these manually in your spreadsheet program or HTML report every single time. It can be used, e.g., for hopper-based interpretation strategies. [Vembrane]((https://github.com/vembrane/vembrane) is used for creating filtered subsets of the VCF files from which additional TSV files and HTML reports will be created. The workflow always creates a report for the VCF file after transcript filtering. It can be configured as following:

  1. Define your preset filters in a TSV file and supply it with the --custom_filters parameter. You can find an example TSV file in the assets folder: assets/custom_filters.tsv. The TSV files needs two columns: The first column contains the name of the filter (letters, numbers and underscores allowed), the second column a valid python expression defining the filter. The python expression has to follow the guidelines for vembrane, also see here: https://github.com/vembrane/vembrane#filter-expression.
  2. Define the filters to be used in this run with --used_filters. This allows you to define a multitude of filters in a central TSV file, but only use a subset of them for specific runs.

The module performs two consecutive steps for filtering the variants:

  1. Tag variants in their FILTER columns using vembrane tag with the respective filter name. This tag can be found in every report file.
  2. Create additional VCF files by filtering for each of the --used_filters specified using vembrane filter.

Reporting

Summary of final output files in reports/ folder:

  1. TSV files for each sample and preset filter in tsv/
  2. HTML reports for each sample and preset filter in html/
  3. MultiQC report in multiQC
  4. Nextflow pipeline report in pipeline_report/

TSV conversion

The VCF file is converted into a tab-separated values (TSV) file format using vembrane table. This can be easily imported into spreadsheet programs for manual processing. Several parameters can control which VCF fields are extracted into the TSV file. Note that the resulting TSV file is the basis for HTML reports, hence the HTML report will not contain fields missing in the TSV file. The following parameters control the extracted TSV fields:

For an overview of CSQ fields included in the VEP-annotated VCF file, have a look into the VEP output documentation.

HTML report

HTML reports are generated using datavzrd based on a YAML configuration file. The HTML report enables several features including interactive filtering, links within the data or to the internet, plotting, etc. The configuration file is rendered using the YTE template engine that enables usage of python code in YAML files for dynamic rendering. The HTML report can be highly customized by defining the structure and display of each column with the --annotation_colinfo parameter. You can find detailed information in the parameters help text. By default, the preconfigured columns in assets/annotation_colinfo.tsv are used. The pipeline allows the report to distribute columns to different HTML sites using the group identifier making the report more clear and easy to read. You can further define the displayed column name, if the column should be shown by default and create special visualizations, e.g. tick plots, heatmaps and hyperlinks, by defining the data_type of each column. Each HTML report also contains a button to export an excel file (.xlsx format), which differs from the TSV file, e.g., by grouping columns in different sheets based on their group definition.

If you are familiar with datavzrd config files, you can also specify your own datavzrd configuration template using the --datavzrd_config parameter and customize the HTML report even more. By default, the configuration file in assets/datavzrd_config_template.yaml is used and rendered.

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. bcftools. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

It can have the following sections:

Pipeline information

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Tumor Mutational Burden (TMB) calculation module

The TMB calculation module calculates the ratio of eligible mutations per Megabasepair (MBp) using a custom python scripts. The TMB module works in coordination with the BED check module, as a BED file is required to calculate the panelsize-adapted TMB, and requires a set of filtering thresholds which define the eligibility of a mutation. The eligible mutations remaining after each applied filtering step (variant type, coverage, allele frequency, population allele frequency) and the panel size (in bp) inferred from the provided BED file are reported. The calculation will be performed on the vembrane TSV output file, allowing for prefiltering of unwanted mutations using the variant-filter step prior to TMB calculation. TMB calculation will only be performed if a well-formatted BED file was provided to the workflow. Based on the provided BED file, a comparison to the provided breaking threshold (--panelsize_threshold) will be performed. If the BED file covers less base pairs than the provided threshold, calculation will be stopped and a warning raised. TMB calculation is not a unified and standarized process, thus different filtering thresholds can be provided including a variant type filter for SNVs or SNVs and MNVs, lower and upper allele frequency boundaries, a minimal threshold for coverage, a maximal threshold for presence in the gnomAD global population frequency or another defined population database and a flag to filter InDels from the calculation procedure.

The main parameter flags and their respective defaults are:

Contributions and Support

This Pipeline development repository is a collaborative effort of the center for integrated oncology (CIO) of the universities of Aachen, Bonn, Cologne and Düsseldorf (ABCD) to standardize and optimize data analysis in a clinical context.

If you would like to contribute to this pipeline, please see the contributing guidelines.

Contributing authors

Citations

If you use variantinterpretation for your analysis, please cite it using the following doi: 10.5281/zenodo.10036356 An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.