SV-HotSpot: detection of hotspots targeted by recurrent structural variants associated with gene expression

SV-HotSpot is a tool for detection and visualization of genome hotspots targeted by recurrent structural variants (SVs) associated with gene expression. To so so, SV-HotSpot seamlessly integrates SV calls, gene expression, genome annotations (inluding genes and other functional elements such as regulatory elements) to identify and annotate recurrent SVs, and assess their potential consequences on the expression of nearby genes.

SV-HotSpot is developed at Christopher Maher Lab at Washington University in St. Louis.

SV-HotSpot Manual

Installation

Clone the SV-Hotspot repository (with PATH_TO_SV_HOTSPOT is the local directory where you want to install SV-HotSpot):

cd PATH_TO_SV_HOTSPOT
git clone https://github.com/ChrisMaherLab/SV-Hotspot.git

Install required R packages by running the following command inside an R session:

install.packages(peakPick)
install.packages(ggplot2)
install.packages(plyr)
install.packages(reshape2)
install.packages(grid)
install.packages(gridExtra)
install.packages(gtable)
install.packages(data.table)
install.packages(RCircos)
install.packages(stringr)
install.packages(ggsignif)

SV-HotSpot uses bedTools (https://github.com/arq5x/bedtools2) for various overlapping and counting procedures, which should also be installed and made available as command lines.

Preparing Input

The tool requires as an input the following:

Genome assembly name (e.g. hg38, hg19, mm9, mm10, etc.) which is used to extract chromosomes sizes file (a tab-delimited file with two columns, chrom and size). Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). SV-HotSpot built-in genomes are: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
Structural variants file in BEDPE format (Example of how to prepare this file is available in test_data folder). In case only VCF files are available, please try to convert them to BEDPE format using existing tools such vcftobedpe form svtools. Instructions on how to install it and use it can be found here.

Optional files:

Expression data in a matrix format where the first column represents the feature (e.g. gene) and columns represents samples. See the example provided at test_data folder. Please note that features in this file have to be unique.
Copy number segments in BED format. See the example at test_data on how to prepare this file.
An annotation file in BED format. If the user didn't provide this file, a built-in annotation file based on the genome name will be used. Please note that the feature in the expression file has to match the one in this file.
Region of interest file(s) (e.g. enhancers, transcription binding sites, etc) in BED format. An example of this file is given in test_data folder. Multiple files can be provided but have to be separated by comma (make sure no space between the file names).
ChIP-Seq coverage in BED format. See the example at test_data on how to prepare this file.

All other parameters are optional and a default value was assigned to each (run sv-hotspot.pl --help for more details).

IMPORTANT NOTES:

To prepare all your files, please look at the examples provided in test_data folder and do accordingly.
Structural variant types must be in the format of THREE letters and should only include the following types: BND, DEL, DUP, INS, INV.
All files headers should be the same as the ones in the examples files.
The chromosome names in all files should be consistent and in the format of chr#.
The feature name in the annotation file and the expression file should be the same.
The "name" column in the SVs file should be in the format of sample/type (e.g. Sample1/INV)

Running SV-HotSpot

Assume PATH_TO_SV_HOTSPOT is the local directory where SV-HotSpot was installed, the following command runs SV-HotSpot on the mCRPC test data provided by SV-HotSpot. The test data are specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.

perl PATH_TO_SV_HOTSPOT/src/sv-hotspot.pl -o OUTPUT \
    -g hg38 -C chrX \
    --sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \
    -e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
    -c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
    -r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
    --chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
    --chip-cov-lbl H3K27ac \
    -w 100000 -s 30000 -d 50000 \
    --t-amp 2.99 --t-del 1.35 \
    --stat-test wilcox.test --pval 0.05 \
    --plot-top-peaks 10 --left-ext 100000 --right-ext 100000

More SV-HotSpot command line paramters:

USAGE:
      sv-hotspot.pl [OPTIONS] -g/--genome <genomeName> --sv <structuralVariants>

      NOTE:
    (1) Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). 
            - Built-in Genomes: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
            - Please refer to the documentation in case your genome is not listed above. 
    (2) Structutal variants file should be in "BEDPE" format.
    (3) Gene expression data and copy number segments are required to visualize hotspot regions.
    (4) Region of interest file(s) (e.g. promoters, enhancers, chip-seq, etc.) should be in "BED" format

OPTIONS:
    -w/--sliding-win-size       sliding window size <int>       [ sliding window size. default: 100kb ]
    -s/--sliding-win-step       sliding window step     <int>       [ sliding window step. default: 1kb ]
    -a/--annot          annotation file     <filename>  [ an annotation file in "BED" format ]
    -e/--exp            expression file     <filename>  [ expression file in a "matrix" format ]
    -c/--cn             Copy number file    <filename>  [ copy number segments in BED format ]
    -W/--peakPick-window-size   peak calling window     <int>       [ peakPick window size. default: 100bp ]
    -m/--peakPick-min-sd        peak calling min. SD    <int>       [ peakPick standard deviation cutoff. default: 5 ]
    -t/--pct-samples        percentage of samples   <int>       [ percentage of samples cutoff to call peaks. default: 10 ]
    -o/--output         output directory    <string>    [ results output directory. default: ./ ]
    -p/--pval           pvalue cuttoff      <float>     [ pvalue threshold used for significance. default: 0.05 ]
    -G/--genes-of-int       list of genes       <filename>  [ list of genes of interest to be used for visualization ]
    -r/--region-of-int      region(s) of interest   <filename>  [ region of interest file(s) in "BED" format separated by comma ]
    -C/--chrom          chromosome name     <string>    [ chromosome name used to detect hotspots. default: ALL ]
    -S/--sv-type            structural variant type <string>    [ SV type used to detect hotspots. default: ALL ]
    -d/--merge-dist-size        distance size       <int>       [ distance cutoff used to merge adjacent peaks. default: 10kb ]
    --merge-pct-samples     percentage of samples   <int>       [ percentage of samples cutoff to merge similar peaks. default: 5 ]
    --stop-merge-num-peaks      number of peaks     <int>       [ number of peaks cutoff to stop merging adjacent peaks. default: 0 ]
    -k/--num-nearby-genes       Number nearby genes <int>       [ number of up/downstream genes to the peak. default: 1 ]
    --t-amp             amplification threshold <float/int> [ amplification cutoff. default: 2.8 ]
    --t-del             deletion threshold  <float/int> [ deletion cutoff. default: 0.5 ]
    --stat-test         statistical test    <string>    [ statistical test used for comparison (wilcox.test or t.test). default: wilcox.test ]
    --chip-cov          chip-seq coverage   <filename>  [ chip-seq coverage file in "BED" foramt ]
    --chip-cov-lbl          chip-seq coverage label <string>    [ label used for chip-seq coverage ]
    --plot-top-peaks        plot top # peaks    <int>       [ number of top peaks to plot. default: top 10 ]
    --left-ext          size of left extension  <int>       [ size of the left extension of the peak. default: 0bp ]
    --right-ext         size of right extension <int>       [ size of the right extension of the peak. default: 0bp ]
    --genes-to-plot         genes names     <string>    [ names of genes to show on the plot. default: none.
                                          If no genes were provided, all genes in the peak will be plotted ]
    --plot-layout           plot orientation    <string>    [ orientation of the peak plot (wide or narrow). default: narrow ]

Results will be written to a subdirectory named sv-hotspot-output inside OUTPUT directory specified in the command.

Output

There are two main output files:

annotated_peaks_summary.tsv: this file has all information about identified peaks.
genes.associated.with.SVs.tsv: this file contains statisitcal information for all genes affected by SVs.
In addition, SV-HotSpot provides various visualizations composed of overlaying tracks representing copy number aggregation, SV aggregation, and gene/regulatory/region of interest annotation tracks.

Additional files:

Peaks files for each chromosome and their corresponding figures. These files are located in the peaks folder.
UCSC custom track files. These files are located in peaks/ucsc_custom_track_files. These files can be viewed on the UCSC Genome Browser.

Plot Peaks (hotspot sites)

In some cases when the number of detected peaks is high, it is impractical to plot all peaks as this process takes long time. Thus, we set the tool to plot only the top # of peaks (default is 10). In case you need to increase/decrease this number, you need to provide this parameter --plot-top-peaks=# with the number of peaks required. Set this parameter to 0 in case you do not want to plot any peaks.

To plot peak(s), we provided a script for this process. You just need to provide the peak name(s), SV file, the results directory, the expression, and copy number data with the remaining parameters shown above. Peak names must be separated by comma. To show the usage page of this command type the following command:

plot-peak.pl

USAGE:
      plot-peak.pl [OPTIONS] -p <peakName1,peakName2,...> --sv <structuralVariants> --res-dir <resultsDirectory> -e/--expr <expression> -c/--cn <copynumber>

      NOTE:
    (1) Results directory should be the same as the output directory used with sv-hotspot.pl

OPTIONS:
    -a/--annot  Annotation file     <filename>  [ an annotation file in BED format ]
    -o/--output output directory    <string>    [ default: ./ ]
    --t-amp     amplification threshold <float/int> [ threshold for copy number amplifications. default: 2.8 ]
    --t-del     deletion threshold  <float/int> [ threshold for copy number deletions. default: 0.5 ]
    --chip-cov  chip-seq coverage   <filename>  [ If ChIP-Seq coverage file is provided, peaks will be overlapped with this file ]
    --chip-cov-lbl  chip-seq coverage label <string>    [ the chip-seq coverage label used in the plot (e.g. histone name) ]
    --left-ext  size of left extension  <int>       [ number of extended bases from the left side of the peak. default: 0 ]
    --right-ext size of right extension <int>       [ number of extended bases from the right side of the peak. default: 0 ]

As an example, the following command plots peak "pX.55.1" from the above test.

PATH_TO_SV_HOTSPOT/src/plot-peak.pl -p pX.55.1 \
             --sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \ 
             --res-dir /RESULTS/PATH \
             -e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
             -c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
             --chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
             -r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
             -o /SOME/PATH --t-amp 2.99 --t-del 1.35 --chip-cov-lbl H3K27ac \
             --left-ext 100000 --rigth-ext 100000

Example of SV-HotSpot visualization

(1) Recurrent SVs targeting a non-coding region located upstream of AR gene:

(2) Circos plot of recurrent SV hotspots identified by SV-HotSpot across the genome:

SV-HotSpot Docker Instructions

To use SV-HotSpot, a docker image has been created and tested on Linux and Mac. To run SV-HotSpot, you need to have Docker installed on your machine.

Docker Installation

Ubuntu: follow the instructions to get Docker CE for Ubuntu.
Mac: follow the instructions to install the Stable verion of Docker CE on Mac.
Post-processing and filtering

Once hotspots of SVs are identified, users should perform post-processing and filtering. SV-HotSpot provides a script to perform post-filtering of the results based on multiple criteria including filtering of genes with low expression and/or weak expression association (p-value, logFC, mean group expression), removal of wide peaks not associated with known cancer genes (eg. using COSMIC census genes), removal of peaks with low number of samples harboring hotspot SVs. To performed filtering, run the filter.r command as follows:
```
 Usage:  Rscript filter.r
             <SV-HotSpot_result_dir>
             <max.p-value-to-infer-expression-association>
             <min.logfc>
             <min.group.expr>
             <max.number.of.associated.gene.per.peak, peak w/ more genes associated are ignored>
             <max.peak.length, peak wider than this w/o assoc. w/ known genes are ignored>
             <min.peak.percent.samples, 0-100, peaks with percent of samples smaller than this are ignored>
             <tsv_file_of_cosmic_census_genes>
             <csv_other_known_genes_to_keep>

         Example: 
            Rscript filter.r /path/to/sv-hotspot-output 0.05 0.22 10 9 500000 15 \
                             data/cosmic_census_genes.tsv AR,ERG,PTEN,TP53
```
Output will be written to SV-HotSpot result directory which includes the following two files:
```
1) annotated_peaks_summary.filtered.tsv
2) genes.associated.with.SVs.filtered.tsv
```
Visualization of hotpots (peaks)

The top hotspots are automatically visualized when running SV-Hotspot detection (parameter --plot-top-peaks=# when running SV-HotSpot detection). Additionally, users can plot hotspots identified by SV-HotSpot independently of SV-Hotspot detection. To plot hotspot sites (or peaks), run the following command:

To show the usage page of this script, run the following command:
```
docker run chrismaherlab/sv-hotspot plot-peak
```
To plot peaks, use the following command which plots pX.172 and pX.173 peaks (peak name(s) taken from "annotated_peaks_summary.tsv" output file). You will also need to provide the SV calls file used to run SV-Hotspot, path to SV-HotSpot results directory, expression and copy number data, and the remaining parameters shown in SV-HotSpot detection command above. Peak names must be separated by comma and no space between them.
```
docker run -v "$PWD":/data chrismaherlab/sv-hotspot plot-peak \
           -p pX.55.1 --genes-to-plot AR \
           --res-dir /data/SV-HOTSPOT-TEST/sv-hotspot-output \
           -o /data/SV-HOTSPOT-TEST/sv-hotspot-output \
           --sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
           -c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
           -r /data/test_data/enhancers.bed --t-amp 2.99 --t-del 1.35 \
           --chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000
```
- res-dir must refer to the folder containing the results generated from running sv-hospot command with -o option. Please note that you need to include "sv-hotspot-output" at the end of res-dir path since SV-HotSpot always creates this folder which is used to write all output results.
Visualization of whole genome and individual chromsomes

SV-HotSpot also provides whole genome level visualization of hotspot sites. The script plot_whole_genome.r can be used to generate the histogram aggregation of the counts of samples harboring structural variants at whole genome level (circos plot) or individual chromosome level. This script can be run as follows.
```
   Usage: Rscript plot_whole_genome.r
             <SV-Hotspot_result_dir>
             <annotation_dir>
             <genome_assembly_version, eg. hg38>
             <plot_circos, eg. TRUE, FALSE>
             <chromosomes_to_plot, eg. "ALL", "chr1,chrX"> 
             <genes_to_show, eg. "ERG,PTEN,ETV1">
             <color_genes_by_association_direction_with_SV, eg. "TRUE", "FALSE">
             <output_figure_format, eg. "png", "pdf">
             <output_figure_dir>

   Example:
           Rscript plot_whole_genome.r /path/to/sv-hotspot-output /path/to/annotations \
                                        hg38 TRUE chr1,chrX,chr21 ERG,AR,PTEN TRUE png out
```
References
1. Quigley, D. A., Dang, H. X., Zhao, S. G., Lloyd, P., Aggarwal, R., Alumkal, J. J., ... & Playdle, D. (2018). Genomic hallmarks and structural variation in metastatic prostate cancer. Cell, 174(3), 758-769.
How to cite SV-HotSpot?

Eteleeb, A.M., Quigley, D.A., Zhao, S.G. et al. SV-HotSpot: detection and visualization of hotspots targeted by structural variants associated with gene expression. Sci Rep 10, 15890 (2020). https://doi.org/10.1038/s41598-020-71168-7

Contact

Abdallah Eteleeb: eteleeb@wustl.edu

ChrisMaherLab / SV-Hotspot

readme

SV-HotSpot: detection of hotspots targeted by recurrent structural variants associated with gene expression

SV-HotSpot Manual

Installation

Preparing Input

Running SV-HotSpot

Output

Plot Peaks (hotspot sites)

Example of SV-HotSpot visualization

(1) Recurrent SVs targeting a non-coding region located upstream of AR gene:

(2) Circos plot of recurrent SV hotspots identified by SV-HotSpot across the genome:

SV-HotSpot Docker Instructions

Docker Installation

Post-processing and filtering

Visualization of hotpots (peaks)

Visualization of whole genome and individual chromsomes

References

How to cite SV-HotSpot?

Contact