Zheng-NGS-Lab / SplitFusion

SplitFusion is a clinical-grade fusion detection algorithm with high research potentials.
https://github.com/Zheng-NGS-Lab/SplitFusion
4 stars 1 forks source link
gene-fusion

Introduction

Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia and EML4-ALK in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.

Challenges in the diagnosis of gene fusions include that there could be many and sometimes unknown fusion partners, low gene expression (e.g. ALK), non fusion-specific protein expression (e.g. ROS1), potential involvments of cryptic splice sites, low sequence diversity at genomic breakpoints and associated mapping difficulty, and poor sample (poor quality, limited amount and low tumor cellularity) in clinical specimens. The anchored multiplex PCR (AMP) is a clinically proven technology that addresses all these issues and has accelerated gene fusion discoveries and supported robust clinical diagnosis (Zheng Z, et al. Anchored multiplex PCR for targeted next-generation sequencing. Nat Med. 2014).

Equally important to a robust wet lab technology is a high-performing computational method for calling gene fusions. SplitFusion is fast by leveraging the chimeric split-read alignments of BWA-MEM (Li H. 2013). SplitFusion is agnostic to known coding transcripts. SplitFusion is sensitive, specific, computationally efficient, and features highly desirable abilities in clinical reporting, including the capabilities to infer fusion transcript frame-ness and exon-boundary alignments; to calculate number of unique DNA/RNA fragment ligation sites; and the SplitFusion-Target mode allows for continuous evidence-based improvement in clinical reporting.

SplitFusion can be used for RNA-seq data and the Anchored Multiplex PCR (AMP) data.

How does SplitFusion work?

The analsyis consists of the following computational steps:

  1. Reference alignment and deduplication.

  2. CIGAR transformation.

  3. Candidate breakpoint calling.

  4. Initial breakpoint filtering.

  5. Breakpoint gene annotation, frame-ness, exon boundary, further filtering and target reporting.

  6. Result reporting and visualization.

Dependencies

When running SplitFusion, you can specify paths to the tools and genome files you already have. If not, here are the human genome data and tools for installation.

Installation

cd /home/user1/tools/
git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git

Run

1. Help

    python /home/user1/tools/SplitFusion/exec/SplitFusion.py -h

usage: SplitFusion.py [-h] --refGenome REFGENOME --annovar SNPEFF_PATH --samtools
                      SAMTOOLS --bedtools BEDTOOLS --bwa BWA --R R --perl PERL
                      --output OUTPUT --sample_id SAMPLE_ID
                      [--bam_file BMS_FILE] [--fastq_file1 FASTQ_FILE1] [--fastq_file2 FASTQ_FILE2]
                      [--panel_dir PANEL_DIR] [--panel PANEL] [--steps STEPS]
                      [--AnnotationMethod ANNOTATIONMETHOD] [--thread THREAD]
                      [--minMQ MINMQ] [--minMQ1 MINMQ1]
                      [--minMapLength MINMAPLENGTH]
                      [--minMapLength2 MINMAPLENGTH2]
                      [--maxQueryGap MAXQUERYGAP] [--maxOverlap MAXOVERLAP]
                      [--minExclusive MINEXCLUSIVE]
                      [--FusionMinStartSite FUSIONMINSTARTSITE]
                      [--minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION]
                      [--minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION]

Split-Fusion is a fast data analysis pipeline detects gene fusion based on
chimeric split-read alignments.

optional arguments:
  -h, --help            show this help message and exit
  --refGenome REFGENOME
                        The reference genome file, with a full path
                        [required].
  --annovar SNPEFF  The snpEff path [required].
  --samtools SAMTOOLS   The samtools executable file with full path [Optional].
  --bedtools BEDTOOLS   The bedtools executable file with full path [Optional].
  --bwa BWA             The bwa executable file with full path [Optional].
  --R R                 The R executable file with full path [Optional].
  --perl PERL           The perl executable file with full path [Optional].
  --output OUTPUT       The directory for output SplitFusion results [required].
  --sample_id SAMPLE_ID
                        The name of sample to be analyzed [required].
  --bam_file BAM_FILE   If the bam_file is specified, the Kickstart mode will
                        be used. [Either fastq_file or bam_file should be
                        specified]
  --fastq_file1 FASTQ_FILE1
                        The fastq file (Read 1 of paired-end) to be analyzed
                        [Either fastq_file or bam_file should be specified].
  --fastq_file2 FASTQ_FILE2
                        Read 2 of paired-end fastq file.
  --panel_dir PANEL_DIR
                        For TARGET mode: the path where known significant
                        fusions or splicing isoforms (whitelist) or unwanted
                        fusions involving homologous genes or recurrent falsed
                        positives (blacklist) are stored. Default='NA'
  --panel PANEL         The prefix name of TARGET gene panel file. E.g.,
                        'LungFusion' for LungFusion.GSP2.bed. Default='NA'
  --steps STEPS         Specify steps to run. Default='1_fastq-bam,2_bam-
                        breakpoint,3_breakpoint-filter,4_breakpoint-anno
                        ,5_breakpoint-anno-post'

2. Examples


# Examples of running different modes of SplitFusion

## First, copy example data files from pipeline to local testing directory, e.g.:

    mkdir -p /home/user1/SplitFusion-test/data
    cp /home/user1/tools/SplitFusion/inst/data/example_data/Lib001.* /home/user1/SplitFusion-test/data/

## Before using BWA, the reference genome needs to be indexed, e.g.:

    /home/user1/tools/bwa/bwa index -a bwtsw /home/user1/database/Homo_sapiens_assembly19.fasta

##=========================================================
## Start from FASTQ files, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
        --refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
        --annovar /home/user1/tools/snpEff \
        --samtools /home/user1/tools/samtools/bin/samtools \
        --bedtools /home/user1/tools/bedtools2/bin/bedtools \
        --bwa /home/user1/tools/bwa/bwa \
        --R /usr/bin/R \
        --perl /usr/bin/perl \
        --output /home/user1/SplitFusion-test/output \
        --sample_id "Lib001" \
        --fastq_file1 /home/user1/SplitFusion-test/data/Lib001.R1.fq \
        --fastq_file2 /home/user1/SplitFusion-test/data/Lib001.R2.fq \
        --thread 4

##=========================================================
## Kickstart mode, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
        --refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
        --annovar /home/user1/tools/snpEff \
        --samtools /home/user1/tools/samtools/bin/samtools \
        --bedtools /home/user1/tools/bedtools2/bin/bedtools \
        --bwa /home/user1/tools/bwa/bwa \
        --R /usr/bin/R \
        --perl /usr/bin/perl \
        --output /home/user1/SplitFusion-test/kickstart-mode-output \
        --sample_id "Lib001" \
    --bam_file /home/user1/SplitFusion-test/data/Lib001.bam \
        --thread 1

##=========================================================
## TARGET gene panel mode, with panel info
##=========================================================
mkdir -p /home/user1/SplitFusion-test/panel
cp /home/user1/tools/SplitFusion/inst/data/panel/* /home/user1/SplitFusion-test/panel/

python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
        --refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
        --annovar /home/user1/tools/snpEff \
        --samtools /home/user1/tools/samtools/bin/samtools \
        --bedtools /home/user1/tools/bedtools2/bin/bedtools \
        --bwa /home/user1/tools/bwa/bwa \
        --R /usr/bin/R \
        --perl /usr/bin/perl \
        --output /home/user1/SplitFusion-test/target-mode-output \
        --sample_id "Lib001" \
        --fastq_file1 /home/user1/SplitFusion-test/data/Lib001.R1.fq \
        --fastq_file2 /home/user1/SplitFusion-test/data/Lib001.R2.fq \
    --panel_dir /home/user1/SplitFusion-test/panel \
    --thread 4 \
        --panel LungFusion

##=======================================================================
## Selecting only some steps to run.
##   If you have run Steps 1 and 2, and just want to repeat analyses from
## the Step 3 (e.g. the whitelist/blacklist has been updated), then:
##=======================================================================
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
        --refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
        --annovar /home/user1/tools/snpEff \
        --samtools /home/user1/tools/samtools/bin/samtools \
        --bedtools /home/user1/tools/bedtools2/bin/bedtools \
        --bwa /home/user1/tools/bwa/bwa \
        --R /usr/bin/R \
        --perl /usr/bin/perl \
        --output /home/user1/SplitFusion-test/output \
        --sample_id "Lib001" \
        --fastq_file1 /home/user1/SplitFusion-test/data/Lib001.R1.fq \
        --fastq_file2 /home/user1/SplitFusion-test/data/Lib001.R2.fq \
        --panel_dir /home/user1/SplitFusion-test/panel \
        --panel LungFusion \
    --steps "3_breakpoint-filter,4_breakpoint-anno,5_breakpoint-anno-post"

Output

An example brief output table:

SampleID GeneExon5_GeneExon3 frame num_partner_ends num_unique_reads exon.junction breakpoint transcript_5 transcript_3 function_5 function_3 gene_5 cdna_5 gene_3 cdna_3
Lib001 EML4_intronic---ALK_exon20 in-frame 9 11 Both 2_29446396__2_42492091 NM_019063 NM_004304 intronic exonic EML4 667 ALK 3171
Lib001 EML4_exon6---ALK_exon20 in-frame 2 1 Both 2_29446396__2_42491871 NM_019063 NM_004304 exonic exonic EML4 667 ALK 3171

An example output fastq file for the EML4_intronic---ALK_exon20 fusion of sample Lib001 is:


>NS500673:45:HHK2HAFXX:1:11206:14487:5244:
ATTGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTATTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTGTCTGCAATTTTGGTAACTTTTG
>NS500673:45:HHK2HAFXX:1:11307:26804:17812:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTG
>NS500673:45:HHK2HAFXX:2:11207:15517:8468:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:2:21212:23872:20046:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTG
>NS500673:45:HHK2HAFXX:3:21410:6281:16202:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTG
>NS500673:45:HHK2HAFXX:3:21605:12646:6668:
GTCATCATCAACCAAGCAAAAATGTCAACTCGCGAAAAAAACAGCCAAGTGTTCCGCCGGAAGCACCAGGAGCTGCAAGCCAT
>NS500673:45:HHK2HAFXX:4:21510:10840:13732:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTAGTTGATGATGACATC
>NS500673:45:HHK2HAFXX:4:21605:21501:18033:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGGTGATGACATCTTTATGCTTG

Visualization (on PC or Mac)


Within R, run:

> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install("igvR")
> source("SplitFusion/R/bam2igv.R")
> bam2igv(bamfile = "/home/user1/SplitFusion-test/output/Lib001/Lib001.EML4_intronic---ALK_exon20.bam")

An visualization of the EML4 intronic fusion site: Segmentfault

An visualization of the ALK exon20 fusion site: Segmentfault