aganezov / RCK

RCK: Reconstruction of clone- and haplotype-specific Cancer Karyotypes
MIT License
17 stars 4 forks source link

RCK
Reconstruction of clone- and haplotype-specific Cancer Karyotypes

MIT licensed Python 3.7 Build Status

RCK - is a method for Reconstruction of clone- and haplotype-specific Cancer Karyotypes from tumor mixtures, distributed both as a standalone software package and as a Python library under the MIT licence.

RCK has been initially designed and developed by Sergey Aganezov in the group of prof. Ben Raphael at Princeton University (group site). Current development of RCK is continued by Sergey Aganezov in the group of prof. Michael Schatz at Johns Hopkins University (group site).

The full description of the algorithm and its application on published cancer datasets are described in:

Sergey Aganezov and Benjamin J. Raphael, 2019

Contents:

  1. Algorithm overview
  2. Installation
  3. Input preprocessing
    1. Novel Adjacencies
    2. Segment copy numbers
  4. High-level RCK data processing recipe
  5. Running RCK
  6. Results
  7. Citation
  8. Issues

Algorithm Overview

RCK overview

RCK infers clone- and haplotype-speicifc cancer genome karyotypes from tumor mixtures.

RCK assumes that:

RCK uses a Diploid Interval Adjacency Graph to represent all possible segments and transitions between them (across all clones and the reference). RCK then solves an optimization problem of inferring clone- and haplotype-specific karyotypes (i.e., finding clone-specific edge multiplicity functions in the constructed DIAG) as an MILP program. Several constraints are taken into consideration (some of which are listed below) during the inference:

We note, that in contrast to some other cancer karyotype inference methods, RCK model has several advantages, that all work in q unifying computation framework and some/all of which differentiate RCK from other methods:

Installation

RCK shall work on latest macOS, and main Linux distribution. RCK is implemented in Python and designed to work with Python 3.7+. We highly recommend creating an independent python virtual environment for RCK usage.

RCK itself can be installed in three different ways:

RCK requires an ILP solver installed on the system, as well as python bindings for it. Currently only Gurobi ILP solver is supported.

For more details about installation please refer to the installation documentation.

Input (preprocessing)

The minimum input for RCK is comprised of two parts:

  1. Unlabeled novel adjacencies (aka structural variations in the tumor sample)
  2. Clone- and allele-specific segment copy numbers

Additional input can contain:

RCK expects the input data to be in a (C/T)SV (Coma/Tab Separated Values) format. We provide a set of utility tools to convert input data obtained from a lot of state-of-the-atr methods outputs into the RCK suitable format.

Novel Adjacencies

Obtaining unlabeled (i.e., without allele-information) novel adjacencies (aka Structural Variants) is not a part of the RCK workflow, as there exist a lot of tools for obtaining those. We provide a rck-adj-x2rck utility to convert output from output format of SV detection tools to the RCK suitable format. We currently support converting the output of the following 3rd-party SV detection tools:

For more information about adjacencies, formats, converting, reciprocality, etc, please refer to adjacencies documentation

Segment copy numbers

Obtaining clone- and allele-specific segment copy numbers is not a part of the RCK workflow, as there exist a lof of tools for obtaining those. We provide a rck-scnt-x2rck utility to convert output from output format of other tools that infer clone- and allele-specific segment copy numbers to the RCK suitable format. We currently support converting the output of the following 3rd-party tools:

RCK data processing recipe

For the most cases the cancer sample of interest is initially represented via a set cancer.sr.fastq of reads obtained via a sequencer. Additionally, a sequenced reads normal.sr.fastq from a matching normal sample need to be available.
Most often case of analysis consists of having a standard Illumina paired-end sequenced reads for both the tumor and the matching normal. Increasingly 3rd-generation sequencing technologies are being utilized in cancer analysis. Let us assume that there may optionally be a set cancer.lr.fastq of reads for the cancer sample in question obtained via 3rd-generation sequencing technology.

  1. Align sequenced reads (with you aligner of choice) cancer.sr.fastq and normal.sr.fastq for cancer and a matching normal samples to obtain cancer.sr.bam and normal.sr.bam
    1. Optionally align sequenced long reads cancer.lr.fastq to obtain cancer.lr.bam
  2. Run a tool of you choosing on cancer.sr.fastq to obtain a novel adjacencies VCF file cancer.sr.vcf
    1. Optionally infer novel adjacencies on long-read dataset obtaining cancer.lr.vcf
    2. Merge short- and long-read novel adjacencies into a unified set cancer.vcf (we suggest using SURVIVOR tool [code | paper] for this task)
  3. Convert novel adjacencies from VCF file cancer.vcf to the RCK input format via rck-adj-x2rck x cancer.vcf -o input.rck.adj.tsv, where x stands for the novel adjacency inference tool. Please, see adjacencies docs for list of supported tools and more detailed instructions on comparison.
  4. Run any of the supported tools (HATCHet, TitanCNA, Battenberg, ReMixT) of choice to infer large-scale clone- and allele-specific fragment copy numbers CN.data (generic name of the tool-specific result)
  5. Convert tool-specific copy-number data CN.data into RCK format via rck-scnt-x2rck x CN-data -o input.rck.scnt.tsv, where x stands for copy number inference tool. Please, see segments docs for link to specific methods, as well as details on how to run conversion.
  6. Run RCK

Running RCK

We provide the the rck tool to run the main RCK algorithm for clone- and haplotype specific cancer karyotypes reconstruction.

With the minimum input for RCK the following is the example of running RCK:

rck --scnt input.rck.scnt.tsv --adjacecnies input.rck.adj.tsv

where:

Additionally one can specify the --workdir working directory, where the input, preprocessing, and the output will be stored. For more on the rck command usage please refer to usage documentation.

Results

Here is the description of the results produced by rck main method for cancer karyotype reconstruction. For results on segment/adjacency conversion/processing, please refer to respective segment/adjacency documentations.

RCK's cancer karyotype reconstruction is stored in the output subdirectory in the working directory (the --workdir). The following two files depict the inferred clone- and haplotype-specific karyotypes:

For information about the format of the inferred clone- and haplotype-specific copy numbers on segments/adjacencies please refer to segment/adjacency documentations

Results in the original manuscript can be found in the dedicated Github repository.

Citation

When using RCK's cancer karyotype reconstruction algorithm or any of RCK's utilities, please cite the following paper:

Sergey Aganezov and Benjamin J. Raphael, 2019

Issues

If you experience any issues with RCK installation, usage, or results or want to see RCK enhanced in any way, shape or form, please create an issue on RCK issue tracker. Please, make sure to specify the RCK's, Python's, and Gurobi's versions in question, and, if possible, provide (minimized) data, on which the issue(s) occur(s).

If you want to discuss any avenues of collaboration, custom RCK applications, etc, please contact Sergey Aganezov at aganezov(at)jhu.edu or sergeyaganezovjr(at)gmail.com