SV-Bay

SV-Bay is a tool for structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability. The algorithm description can be found in the article.

Installation

SV-Bay is implemented is Python 2 and was tested in both Linux and Mac OS X. Though it works with both Python 2.6 and 2.7, we strongly recommend to use 2.7, as it shows a significant performance improvement due to the difference in GC implementations.

A number of python libraries are required to run SV-Bay. The installation from scratch for Ubuntu 14.04 is shown below:

sudo apt-get update
sudo apt-get install build-essential python-dev zlib1g-dev unzip
sudo apt-get install python-numpy python-scipy python-matplotlib
wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install pyaml pysam joblib

After that you can clone SV-Bay repository:

sudo apt-get install git
git clone https://github.com/InstitutCurie/SV-Bay.git

Then edit sample config.yaml file and proceed to input data preparation, as explained below.

Configuration

SV-Bay uses config file in YAML format. This file is common for all processing steps. There are following config options not related to input data (options related to input data are described in the next section):

__working_dir : "/.../sv-bay-data/"__ Common directrory for all processing. All other files and folders will be created inside this one.

clustering_parallel_processes : 1 Number of parallel threads for clustering. SV-Bay works fast even with one process (less then 2 hours for mate-pair data with coverge 12). If the number of processes is more than 1, clustering log would be unordered and very hard to read, so change it only if speed is crucial for you.

chromosomes : ["chr14","chr15", "chr17"] Chromosomes to process.

exp_num_sv: 100 Expected number of structural variants.

alpha : 0.01 Distribution cutoff used when deciding whether read is normal or abnormal.

__read_length : 50__ Read length in input data.

ploidy : 4.0 Ploidy of input data.

__numb_allel : 8__ Minimum alleles to mark SV as co-amplification.

__links_probabilities_file : "links_probabilities.txt"__ File to output resulting clusters and probabilities.

__valid_links_dir : "valid_links/"__ Directory to output resulting valid clusters.

There are also several internal options: debug, clustering_log_file, probabilites_log_file, normal_fragments_dir, length_histogram_file, clusters_files_dir, __lambda_file, serialized_stats_file__. They are described in sample config.yaml file, generally it is unnecessary to change their default values.

Input data

SV-Bay requires a number of input files to work. It can look a bit confusing, but most of this files are common for human genome and can be simply downloaded. Config options related to input are described below:

sam_files_dir : "bam/" Input directory with per-chromosome bam or sam files. Bam should be sorted and indexed, .bam.bai files should be in the same folder. Name of file for each chromosome must contain "chrSomething" in it's name, e.g. "chr7_sorted.bam" or "chrX.sam". If you have one bam for the whole genome, use utils/separately_save_sam.py script to split it:

python src/utils/separately_save_sam_samtools.py -i yourBigBAMfile.bam -o outputDir/

fa_files_dir : "fa/" Input directory with per-chromosome .fa files. Fa file names should consist exactly of chromosome name and extension, e.g. chr14.fa. You can download fa files for hg19 and hg38: http://xfer.curie.fr/get/aBxK5d1BWr6/hg19_chromosomes_fa.zip and http://xfer.curie.fr/get/2mRqHdYxzw4/hg38_chromosomes_fa.zip.

gem_files_dir : "gem/" Input directory with per-chromosome .gem mappability files. Gem file names should consist exactly of chromosome name and extension, e.g. chr14.gem. You can download pre-calculated gem for hg19 and hg38: http://xfer.curie.fr/get/kRScTtWDgdA/gem_hg19.tar.gz and http://xfer.curie.fr/get/pVFoxp28pBt/gem_hg38.tar.gz. If you have one gem for the whole genome, use utils/separately_save_gem.py script to split it:

python src/utils/sep_save_gem.py -i yourBigGEMfile.gem  -o outputDir/

centromic_file : "centrom_hg38.txt" Input file with information about centromere positions in human genome. Files for hg19 and hg38 are availdable in data subfolder of SV-Bay repository (data/centrom_hg19.txt and data/centrom_hg38.txt).

__cnv_file: "simulated_reads_cnv.txt"__ File generated by Control-FREEC. For the test data it is available in data subfolder of SV-Bay repository (data/simulated_reads_cnv.txt).

Preparation of the example data to run SV-Bay is shown below:

mkdir sv-bay-data/ && cd sv-bay-data
mkdir bam && cd bam
wget https://www.dropbox.com/s/zcojeehmhkygli4/bam_tumor.tar.gz && tar xzf bam_tumor.tar.gz && mv bam_tumor/* . && cd ..
mkdir fa_files && cd fa_files
wget http://xfer.curie.fr/get/2mRqHdYxzw4/hg38_chromosomes_fa.zip && unzip hg38_chromosomes_fa.zip && cd ..
mkdir gem_files && cd gem_files
wget http://xfer.curie.fr/get/pVFoxp28pBt/gem_hg38.tar.gz && tar xzf gem_hg38.tar.gz && mv gem_hg38/* . && cd ..
cp ~/SV-Bay/data/centrom_hg38.txt .
cp ~/SV-Bay/data/simulated_reads_cnv.txt .

Now change __working_dir__ in sample config and you are ready to run SV-Bay.

Workflow

SV-Bay workflow consists of 3 steps. Config file is common for all steps.

Normal/abnormal fragments separation and clustering

On this step SV-Bay calculates statistics of fragment length distribution, separates normal/abnormal fragments and clusters abnormal fragments.

python -B src/main_clustering.py -c config/config.yaml

Applying probabilistic model to validate clusters

On this step SV-Bay calculates probability for each cluster to determine whether it is noise or real SV.

python -B src/main_probabilities.py -c config/config.yaml

Complex SVs assembly

On this step SV-Bay assembles clusters to complex and simple SVs and outputs final results.

python -B src/main_assemly_links.py -c config/config.yaml > results

The script main_assemly_links.py can also exclude germline mutations, if the respective data is available. To do so, run main_clustering.py for germline dataset using a separate working_dir and than run main_assemly_links.py with flag -n and name of the folder with germ-line clusters:

python -B src/main_clustering.py -c config/config_germ.yaml
python -B src/main_assemly_links.py -c config/config.yaml -n '/home/sv-bay/sv-bay-data-germ/cluster_files/' > results

Test data

Please download example tumor and control bam files for chromosomes 14, 15, 17 to test SV-Bay here, this is tar.gz file approximately 1.7GB which contains all data that you would need to run the tool for 3 chromosomes:

separated fasta files (hg38 version)
.gem files
bam files for tumor samples
bam files for control samples
Results of FREEC tool run (for tumor sample)
list of centromeres (hg38 version) Config file for test data is located in
```
config/config_test/config_tumor.yaml
```

BoevaLab / SV-Bay

readme