ChrisMaherLab / SV-Hotspot

8 stars 6 forks source link

SV-HotSpot: detection of hotspots targeted by recurrent structural variants associated with gene expression

tool logo

SV-HotSpot is a tool for detection and visualization of genome hotspots targeted by recurrent structural variants (SVs) associated with gene expression. To so so, SV-HotSpot seamlessly integrates SV calls, gene expression, genome annotations (inluding genes and other functional elements such as regulatory elements) to identify and annotate recurrent SVs, and assess their potential consequences on the expression of nearby genes.

SV-HotSpot is developed at Christopher Maher Lab at Washington University in St. Louis.

SV-HotSpot Manual

Installation

Clone the SV-Hotspot repository (with PATH_TO_SV_HOTSPOT is the local directory where you want to install SV-HotSpot):

cd PATH_TO_SV_HOTSPOT
git clone https://github.com/ChrisMaherLab/SV-Hotspot.git

Install required R packages by running the following command inside an R session:

install.packages(peakPick)
install.packages(ggplot2)
install.packages(plyr)
install.packages(reshape2)
install.packages(grid)
install.packages(gridExtra)
install.packages(gtable)
install.packages(data.table)
install.packages(RCircos)
install.packages(stringr)
install.packages(ggsignif)

SV-HotSpot uses bedTools (https://github.com/arq5x/bedtools2) for various overlapping and counting procedures, which should also be installed and made available as command lines.

Preparing Input

The tool requires as an input the following:

  1. Genome assembly name (e.g. hg38, hg19, mm9, mm10, etc.) which is used to extract chromosomes sizes file (a tab-delimited file with two columns, chrom and size). Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). SV-HotSpot built-in genomes are: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.

  2. Structural variants file in BEDPE format (Example of how to prepare this file is available in test_data folder). In case only VCF files are available, please try to convert them to BEDPE format using existing tools such vcftobedpe form svtools. Instructions on how to install it and use it can be found here.

Optional files:

  1. Expression data in a matrix format where the first column represents the feature (e.g. gene) and columns represents samples. See the example provided at test_data folder. Please note that features in this file have to be unique.

  2. Copy number segments in BED format. See the example at test_data on how to prepare this file.

  3. An annotation file in BED format. If the user didn't provide this file, a built-in annotation file based on the genome name will be used. Please note that the feature in the expression file has to match the one in this file.

  4. Region of interest file(s) (e.g. enhancers, transcription binding sites, etc) in BED format. An example of this file is given in test_data folder. Multiple files can be provided but have to be separated by comma (make sure no space between the file names).

  5. ChIP-Seq coverage in BED format. See the example at test_data on how to prepare this file.

All other parameters are optional and a default value was assigned to each (run sv-hotspot.pl --help for more details).


IMPORTANT NOTES:

Running SV-HotSpot

Assume PATH_TO_SV_HOTSPOT is the local directory where SV-HotSpot was installed, the following command runs SV-HotSpot on the mCRPC test data provided by SV-HotSpot. The test data are specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.

perl PATH_TO_SV_HOTSPOT/src/sv-hotspot.pl -o OUTPUT \
    -g hg38 -C chrX \
    --sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \
    -e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
    -c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
    -r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
    --chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
    --chip-cov-lbl H3K27ac \
    -w 100000 -s 30000 -d 50000 \
    --t-amp 2.99 --t-del 1.35 \
    --stat-test wilcox.test --pval 0.05 \
    --plot-top-peaks 10 --left-ext 100000 --right-ext 100000

More SV-HotSpot command line paramters:

USAGE:
      sv-hotspot.pl [OPTIONS] -g/--genome <genomeName> --sv <structuralVariants>

      NOTE:
    (1) Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). 
            - Built-in Genomes: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
            - Please refer to the documentation in case your genome is not listed above. 
    (2) Structutal variants file should be in "BEDPE" format.
    (3) Gene expression data and copy number segments are required to visualize hotspot regions.
    (4) Region of interest file(s) (e.g. promoters, enhancers, chip-seq, etc.) should be in "BED" format

OPTIONS:
    -w/--sliding-win-size       sliding window size <int>       [ sliding window size. default: 100kb ]
    -s/--sliding-win-step       sliding window step     <int>       [ sliding window step. default: 1kb ]
    -a/--annot          annotation file     <filename>  [ an annotation file in "BED" format ]
    -e/--exp            expression file     <filename>  [ expression file in a "matrix" format ]
    -c/--cn             Copy number file    <filename>  [ copy number segments in BED format ]
    -W/--peakPick-window-size   peak calling window     <int>       [ peakPick window size. default: 100bp ]
    -m/--peakPick-min-sd        peak calling min. SD    <int>       [ peakPick standard deviation cutoff. default: 5 ]
    -t/--pct-samples        percentage of samples   <int>       [ percentage of samples cutoff to call peaks. default: 10 ]
    -o/--output         output directory    <string>    [ results output directory. default: ./ ]
    -p/--pval           pvalue cuttoff      <float>     [ pvalue threshold used for significance. default: 0.05 ]
    -G/--genes-of-int       list of genes       <filename>  [ list of genes of interest to be used for visualization ]
    -r/--region-of-int      region(s) of interest   <filename>  [ region of interest file(s) in "BED" format separated by comma ]
    -C/--chrom          chromosome name     <string>    [ chromosome name used to detect hotspots. default: ALL ]
    -S/--sv-type            structural variant type <string>    [ SV type used to detect hotspots. default: ALL ]
    -d/--merge-dist-size        distance size       <int>       [ distance cutoff used to merge adjacent peaks. default: 10kb ]
    --merge-pct-samples     percentage of samples   <int>       [ percentage of samples cutoff to merge similar peaks. default: 5 ]
    --stop-merge-num-peaks      number of peaks     <int>       [ number of peaks cutoff to stop merging adjacent peaks. default: 0 ]
    -k/--num-nearby-genes       Number nearby genes <int>       [ number of up/downstream genes to the peak. default: 1 ]
    --t-amp             amplification threshold <float/int> [ amplification cutoff. default: 2.8 ]
    --t-del             deletion threshold  <float/int> [ deletion cutoff. default: 0.5 ]
    --stat-test         statistical test    <string>    [ statistical test used for comparison (wilcox.test or t.test). default: wilcox.test ]
    --chip-cov          chip-seq coverage   <filename>  [ chip-seq coverage file in "BED" foramt ]
    --chip-cov-lbl          chip-seq coverage label <string>    [ label used for chip-seq coverage ]
    --plot-top-peaks        plot top # peaks    <int>       [ number of top peaks to plot. default: top 10 ]
    --left-ext          size of left extension  <int>       [ size of the left extension of the peak. default: 0bp ]
    --right-ext         size of right extension <int>       [ size of the right extension of the peak. default: 0bp ]
    --genes-to-plot         genes names     <string>    [ names of genes to show on the plot. default: none.
                                          If no genes were provided, all genes in the peak will be plotted ]
    --plot-layout           plot orientation    <string>    [ orientation of the peak plot (wide or narrow). default: narrow ]

Results will be written to a subdirectory named sv-hotspot-output inside OUTPUT directory specified in the command.

Output

There are two main output files:

  1. annotated_peaks_summary.tsv: this file has all information about identified peaks.

  2. genes.associated.with.SVs.tsv: this file contains statisitcal information for all genes affected by SVs.

  3. In addition, SV-HotSpot provides various visualizations composed of overlaying tracks representing copy number aggregation, SV aggregation, and gene/regulatory/region of interest annotation tracks.

Additional files:

  1. Peaks files for each chromosome and their corresponding figures. These files are located in the peaks folder.
  2. UCSC custom track files. These files are located in peaks/ucsc_custom_track_files. These files can be viewed on the UCSC Genome Browser.

Plot Peaks (hotspot sites)

In some cases when the number of detected peaks is high, it is impractical to plot all peaks as this process takes long time. Thus, we set the tool to plot only the top # of peaks (default is 10). In case you need to increase/decrease this number, you need to provide this parameter --plot-top-peaks=# with the number of peaks required. Set this parameter to 0 in case you do not want to plot any peaks.

To plot peak(s), we provided a script for this process. You just need to provide the peak name(s), SV file, the results directory, the expression, and copy number data with the remaining parameters shown above. Peak names must be separated by comma. To show the usage page of this command type the following command:

plot-peak.pl

USAGE:
      plot-peak.pl [OPTIONS] -p <peakName1,peakName2,...> --sv <structuralVariants> --res-dir <resultsDirectory> -e/--expr <expression> -c/--cn <copynumber>

      NOTE:
    (1) Results directory should be the same as the output directory used with sv-hotspot.pl

OPTIONS:
    -a/--annot  Annotation file     <filename>  [ an annotation file in BED format ]
    -o/--output output directory    <string>    [ default: ./ ]
    --t-amp     amplification threshold <float/int> [ threshold for copy number amplifications. default: 2.8 ]
    --t-del     deletion threshold  <float/int> [ threshold for copy number deletions. default: 0.5 ]
    --chip-cov  chip-seq coverage   <filename>  [ If ChIP-Seq coverage file is provided, peaks will be overlapped with this file ]
    --chip-cov-lbl  chip-seq coverage label <string>    [ the chip-seq coverage label used in the plot (e.g. histone name) ]
    --left-ext  size of left extension  <int>       [ number of extended bases from the left side of the peak. default: 0 ]
    --right-ext size of right extension <int>       [ number of extended bases from the right side of the peak. default: 0 ]

As an example, the following command plots peak "pX.55.1" from the above test.

PATH_TO_SV_HOTSPOT/src/plot-peak.pl -p pX.55.1 \
             --sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \ 
             --res-dir /RESULTS/PATH \
             -e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
             -c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
             --chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
             -r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
             -o /SOME/PATH --t-amp 2.99 --t-del 1.35 --chip-cov-lbl H3K27ac \
             --left-ext 100000 --rigth-ext 100000

Example of SV-HotSpot visualization

(1) Recurrent SVs targeting a non-coding region located upstream of AR gene:

visualization example

(2) Circos plot of recurrent SV hotspots identified by SV-HotSpot across the genome:

visualization example

SV-HotSpot Docker Instructions

To use SV-HotSpot, a docker image has been created and tested on Linux and Mac. To run SV-HotSpot, you need to have Docker installed on your machine.

Docker Installation