Kingsford-Group / squid

SQUID detects both fusion-gene and non-fusion-gene structural variations from RNA-seq data
BSD 3-Clause "New" or "Revised" License
40 stars 22 forks source link
fusion-genes non-fusion-genes rna-seq structural-variation

SQUID logo{:height="50%" width="50%"}

OVERVIEW

SQUID is designed to detect both fusion-gene and non-fusion-gene transcriptomic structural variations from RNA-seq alignment.

SQUID paper is published at Genome Biology. To reproduce the result of applying SQUID on simulation data and previously studied cell lines, follow the instructions from squidtest

INSTALLING PRE-COMPILED BINARIES

You do NOT need to install SQUID before using it, find the binary release here!

BUILDING FROM SOURCE

You only need to build from source if either the pre-built binaries (see above) don't work on your system or you want to make a change to the SQUID code.

Compiling SQUID requires Boost, GLPK, BamTools. A step by step installation construction can be found here for linux, and here for mac.

On Mac, you need to additionly run the following command to dynamicly linking dependent libraries:

export DYLD_LIBRARY_PATH=<bamtools_folder>/lib
export DYLD_LIBRARY_PATH=<glpk_folder>/lib

USAGE

SQUID takes in a sorted BAM file of RNA-seq alignment and outputs the detection of TSVs. When the concordant and chimeric alignments are separated into two BAM files in the case of STAR alignment, the concordant BAM file must be sorted. The command to run SQUID and the parameters are as follows.

squid [options] -b <Input_sorted_BAM> -o <Output_Prefix>
Parameters Default value Data type Description
-c string
-f string
-pt 0 bool Phred type: 0 for Phred33, 1 for Phred64
-pl 10 int Maximum Length of continuous low Phred score to filter alignment
-pm 4 int Threshold to count as low Phred score
-mq 1 int Minimum mapping quality
-dp 50000 int Maximum paired-end aligning distance to be count as concordant alignment
-di 20 int Maximum distance of segment indexes to be count as read-through
-w 5 int Minimum edge weight
-r 8 double Discordant edge ratio multiplier (normal/tumor cell ratio)
-a 5 int Max allowed degree
-G 0 bool Whether or not output graph file (0 for not outputing, 1 for outputing)
-CO 0 bool Whether or not output ordering of connected components (0 for not outputing, 1 for outputing)
-TO 0 bool Whether or not output ordering of all segments (0 for not outputing, 1 for outputing)
-RG 0 bool Whether or not output rearranged genome sequence (0 for not outputing, 1 for outputing)

OUTPUT SPECIFICATION

EXAMPLE WORKFLOW

Suppose you have the alignment BAM file, and chimeric BAM file generated by STAR (https://github.com/alexdobin/STAR), run SQUID with:

squid -b alignment.bam -c chimeric.bam -o squidout

Or a combined BAM file of both concordant and discordant alignments generated by BWA (http://bio-bwa.sourceforge.net/) or SpeedSeq (https://github.com/hall-lab/speedseq), run SQUID with

squid --bwa -b combined_alignment.bam -o squidout

An example can be run be downloading the sample data (sampledata.tgz) from (https://cmu.box.com/s/e9u6alp73rfdhfve2a51p6v391vweodq) into example folder, and decompress it with

tar -xzvf sampledata.tgz

Run SQUID command in example/SQUIDcommand.sh. Or if you want to test the workflow of STAR and SQUID, make sure STAR is in your path, and run bash script example/STARnSQUIDcommand.sh.

cd example
./SQUIDcommand.sh
./STARnSQUIDcommand.sh

Annotate SQUID output

To label the predicted TSVs as fusion-gene or non-fusion-gene type, and retrieve the corresponding gene names of fusion-gene TSVs, you can use the following python script.

Python dependencies:

Usage:

python <squid_folder>/utils/AnnotateSQUIDOutput.py [options] <GTFfile> <SquidPrediction> <OutputFile>

Note that the GTF file must have the same chromosome name as in SQUID output, and must contain 3 attributes in the transcript record: transcript ID, gene ID, and gene symbol (or gene name).

Options Default value Data type Description
--geneid gene_id string GTF gene ID attribute string, the attribute name in GTF record that corresponds to the gene ID
--genesymbol gene_name string GTF gene symbol attribute string, the attribute name in GTF record that corresponds to the gene symbol