RabadanLab / Pegasus

Annotation and Prediction of Oncogenic Gene Fusions in RNAseq
MIT License
11 stars 8 forks source link

Pegasus : annotation and prediction of oncogenic gene fusions in RNAseq

This software is maintained by Francesco Abate and Sakellarios Zairis. Please cite our paper if you use this tool in your analysis. Pegasus has been successfully deployed in the following selected projects:

Requirements:


Setup:


Clone this repository and do not alter the directory structure. Locally train the classifier as follows:

$ cd learn
$ python train_model.py

This will create two serialized data structures in the current directory, in the form of pickle files. The wrapper for running the pipeline is pegasus.pl and its command line arguments can be seen by executing the file. Each run of Pegasus will require a configuration file based on the template provided, as well as the creation of a data specification file. An example of the data specification file can be found in sample_pipeline_input/data_spec.txt.

Usage:


Copy the sample configuration file to the directory of your run and modify the required fields:

  1. set the path to the cloned Pegasus repository
  2. set the paths for the human genome and annotation reference files
  3. set the sample_type to be analyzed by Pegasus (this string is matched to a descriptor field in the data_spec.txt input file)

Construct a properly formatted data specification file for the samples to be analyzed in the run. If using the "general" input format for fusion candidates, follow the examples included in sample_pipeline_input/candidates. Care must be taken, when constructing a general input file, to correctly identify the start/end points of the 5p and 3p partners with respect to the breakpoint. We use the term "split reads" to mean the number of reads actually containing the breakpoint. A sample invocation of Pegasus from the command line would look like this:

$ pegasus.pl -c config.txt -d data_spec.txt -l log_folder -o output_folder

Pegasus is designed to be interrupted at any stage. When re-running or re-starting the pipeline a second time it is understood that all previously completed steps will be skipped. To effect a complete re-run from the first step, remove the contents of the log and output folders and then invoke pegasus.pl.

Output:


A successfully completed run will produce as the final output a file called pegasus.output.txt. The report will contain the "Pegasus driver score" as its first column, along with many attributes and annotations of the fusion candidates. Some of the key fields are listed below: