This repository is comprises a suite of tools to aid with copy number analysis and tumor purity estimation. These tools are designed for use with data in the VCF format, Mutation Annotation format (MAF), or Bambino format (High20). The types of data accepted are 1) Whole Exome Sequncing Data, 2) Paired Tumor/Normal Whole Genome Sequencing Data, 3) RNA-Seq Analysis - to determine allelic expression.
https://github.com/XCLab/VCF2CNA/blob/master/Figure1_flowchart.pdf
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
What things you need to install the software and how to install them
Linux Operating System
Perl version 5.10.1 or greater
R version 3.4.0
R Packages: Tree, gplots, RColorBrewer, ggplot2, reshape2, grid
Python version 2.7.12
Download the repository and place it in desired directory The repository will be in a folder called VCF2CNA
From the VCF2CNA Folder run the extraction command
bash extract.sh
VCF2CNA relies on germline and somatic SNVs. Do not filter germline SNVs from the input data. VCF2CNA expects paired tumor/normal read count data for heterozygous sites from all chromosomes.
To run the application use the execute.py python script
Argument | Description |
---|---|
FileName | File to Analyze |
FilePath | Path to FileName |
OutputDirectory | Path to store results |
python execute.py [FILENAME][FILEPATH][OUTPUT_DIRECTORY]
Navigate to test data directory
Extract individual xz files and combine them into one file
sh extract_example.sh
Run example data with default parameters
python execute.py test.high_20.out ~/test_data ~/example_output
Run example data with user specified Diploid Chromosome
python execute.py test.high_20.out ~/test_data ~/example_output -DiploidChr=13
Run example data with user specified Region inside Chromosome
python execute.py test.high_20.out ~/test_data ~/example_output -DiploidChr=13 -StartChr=19020701 -EndChr=26516137
Argument | Description | Default |
---|---|---|
VcfOrder | Tumor/Normal TN or NT | TN |
Diploid Chr | Chromosome used for normalization | -1 |
StartChr | Start location on Diploid Chr | -1 |
EndChr | End location on Diploid Chr | -1 |
Median | Median Normal Coverage | -1 |
MinSf | Minimum Scale Factor | 0.50 |
MaxSf | Maximum Scale Factor | 1.50 |
XminSf | Minimum X Scale Factor | 0.25 |
XmaxSf | Maximum X Scale Factor | 1.50 |
R_Loc | Path to R Program | "" |
R_ScriptLoc | Path to RScript Program | "" |
Filename | Description |
---|---|
Samplename_CONSERTING_Mapability_100.txt | Copy Number Segments |
Samplename_LOH_RegTree.txt | Allelic Imbalance by segment |
Samplename.jpg | Visualization of Copy Number |
purity_Samplename.txt | Overall Tumor Purity Estimation |
purity_Samplename.csv | Tumor Purity by Segment and Type |
purity_Samplename.png | Visualization of Purity Segments |
Column | Description |
---|---|
chrom | Chromosome (X=23 and Y=24) |
loc.start | Starting coordinate of the segments |
loc.end | Ending coordinates of the segments |
num.mark | Number of windows in the segment (Sequencing gaps and low mapability scores excluded) |
seg.mean | The estimated GC corrected difference signal (2 copy gain will have a seg.mean of 1) |
GMEAN | The mean coverage in the germline sample (A value of 1 represents diploid) |
DMEAN | The mean coverage in the tumor sample |
LogRatio | The log2ratio of the coverage signal between tumor and germline samples |