Illumina / paragraph

Graph realignment tools for structural variants
Other
152 stars 28 forks source link
genotyping htslib structural-variation variant-calling vcf

Paragraph: a suite of graph-based genotyping tools

Introduction

Accurate genotyping of known variants is a critical for the analysis of whole-genome sequencing data. Paragraph aims to facilitate this by providing an accurate genotyper for Structural Variations with short-read data.

Please reference Paragraph using:

Genotyping data in this paper can be found at paper-data/download-instructions.txt

For details of population genotyping, please also refer to:

Installation

Please check doc/Installation.md for system requirements and installation instructions.

Run Paragraph from VCF

Test example

After installation, run multigrmpy.py script from the build/bin directory on an example dataset as follows:

python3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \
                          -m share/test-data/round-trip-genotyping/samples.txt \
                          -r share/test-data/round-trip-genotyping/dummy.fa \
                          -o test \

This runs a simple genotyping example for two test samples.

The output folder test then contains gzipped json for final genotypes:

$ tree test
test
├── grmpy.log            #  main workflow log file
├── genotypes.vcf.gz     #  Output VCF with individual genotypes
├── genotypes.json.gz    #  More detailed output than genotypes.vcf.gz
├── variants.vcf.gz      #  The input VCF with unique ID from Paragraph
└── variants.json.gz     #  The converted graphs from input VCF (no genotypes)

If successful, the last 3 lines of genotypes.vcf.gz will the same as in expected file.

Input requirements

VCF format

paraGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.

Currently we support 4 symbolic alleles:

Sample Manifest

Must be tab-deliemited.

Required columns:

Optional columns:

Run time

Population-scale genotyping

To efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:

Run Paragraph on complex variants

For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:

Please follow the pattern in example JSON and make sure all required keys are provided. Here is a visualization of this sample graph.

To obtain graph alignments for this graph (including all reads), run:

bin/paragraph -b <input BAM> \
              -r <reference fasta> \
              -g <input graph JSON> \
              -o <output JSON path> \
              -E 1

To obtain the algnment summary, genotypes of each breakpoint, and the whole graph, run:

bin/grmpy -m <input manifest> \
          -r <reference fasta> \
          -i <input graph JSON> \
          -o <output JSON path> \
          -E 1

If you have multiple events listed in the input JSON, multigrmpy.py can help you to run multiple grmpy jobs together.

Further Information

Please check github wiki for common usage questions and errors.

Documentation

External links

License

The LICENSE file contains information about libraries and other tools we use, and license information for these.