SVDSS is a novel method for discovery of structural variants in accurate long reads (e.g PacBio HiFi) using sample-specific strings (SFS).
SFS are the shortest substrings that are unique to one genome, called target, w.r.t another genome, called reference. Here our method utilizes SFS for coarse-grained identification (anchoring) of potential SV sites and performs local partial-order-assembly (POA) of clusters of SFS from such sites to produce accurate SV predictions. We refer to our manuscript on SFS for more details regarding the concept of SFS.
You can "install" SVDSS in two different ways:
To compile and use SVDSS, you need:
To install these dependencies:
# On a deb-based system (tested on ubuntu 20.04 and debian 11):
sudo apt install build-essential autoconf cmake git zlib1g-dev libbz2-dev liblzma-dev samtools bcftools
# On a rpm-based system (tested on fedora 35):
sudo dnf install gcc gcc-c++ make automake autoconf cmake git libstdc++-static zlib-devel bzip2-devel xz-devel samtools bcftools
The following libraries are needed to build and run SVDSS but they are automatically downloaded and compiled while compiling SVDSS:
To download and install SVDSS (should take ~10 minutes):
git clone https://github.com/Parsoa/SVDSS.git
cd SVDSS
mkdir build ; cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
This will create the SVDSS
binary in the root of the repo.
For user convenience, we also provide a static binary for x86_64 linux systems (see Releases) - use at your own risk. If it does not work, please let us know or build it yourself :)
SVDSS is available on bioconda:
conda create -n svdss -c conda-forge -c bioconda svdss
This will create the environment svdss
that includes SVDSS
and its runtime dependencies (i.e., samtools
and bcftools
).
Please refer to or use Snakefile/run-svdss.sh.
Index reference:
SVDSS index --reference /path/to/genome/file --index /path/to/output/index/file
Smooth sample:
SVDSS smooth --reference /path/to/reference/genome/fasta --bam /path/to/input/bam/file > smoothed.bam
Extract SFS from BAM (--bam) or FASTQ/FASTA (--fastx) files:
SVDSS search --index /path/to/index --bam smoothed.bam > specifics.txt
Call SVs:
SVDSS call --reference /path/to/reference/genome/fasta --bam smoothed.bam --sfs specifics.txt > calls.vcf
General options:
--threads sets number of threads (default: 4)
--version print version information
--help print help message
SVDSS requires as input the BAM file of the sample to be genotyped, a reference genome in FASTA format (please use an appropriate reference genome, i.e., if you are not interested in ALT contigs, filter them out or use a reference genome that does not include them). To genotype a sample we need to perform the following steps:
SVDSS index
)SVDSS smooth
)SVDSS search
)SVDSS assemble
)SVDSS call
)In the guide below we assume we are using the reference genome file GRCh38.fa
and the input BAM file sample.bam
. We assume both files are present in the working directory. All of SVDSS steps must be run in the same directory so we always pass --workdir $PWD
for every command.
Note that you can reuse the index from step 1 for any number of samples genotyped against the same reference genome.
We will now explain each step in more detail:
Build the FMD index of the reference genome:
SVDSS index --reference GRCh38.fa --index GRCh38.fmd
The --index
option specifies the output file name.
Smoothing removes nearly all SNPs, small indels and sequencing errors from reads. This results in smaller number of SFS being extracted and increases the relevance of extracted SFS to SV discovery significantly. To smooth the sample run:
SVDSS smooth --reference GRCh38.fa --bam sample.bam --threads 16 > smoothed.bam
This writes to stdout the smoothed bam. This file is sorted in the same order as the input file, however it needs to be indexed again with samtools index
.
To extract SFS run:
SVDSS search --index GRCh38.fmd --bam smoothed.bam > specifics.txt
This writes to stdout the list of specific strings. The output includes the coordinates of SFS relative to the reads they were extracted from.
We are now ready to call SVs. Run (note that the input .bam
must be the same used in the search step and must be indexed using samtools
):
SVDSS call --reference GRCh38.fasta --bam smoothed.bam --sfs specifics.txt --threads 16 > calls.vcf
You can filter the reported SVs by passing the --min-sv-length
and --min-cluster-weight
options. These options control the minimum length and minimum number of supporting superstrings for the reported SVs. Higher values for --min-cluster-weight
will increase precision at the cost of reducing recall. For a diploid 30x coverage sample, --min-cluster-weight 2
produced the best results in our experiments. For a haploid 30x sample, instead, --min-cluster-weight 4
produced the best results.
This commands output the calls to stdout. Additionally, you can output the alignments of POA contigs against the reference genome (these POA consensus are used to call SVs) using the --poa
option.
For user convenience, we distribute a Snakefile to run the entire pipeline, from reference + aligned reads to SVs:
# update config.yaml to suit your needs
# run:
snakemake [-n] -j 4
Note: to run this example, samtools
and bcftools
must be in your path. Running SVDSS
on the example data, once downloaded, should take less than 5 minutes.
# Download example data from zenodo
wget https://zenodo.org/record/6563662/files/svdss-data.tar.gz
mkdir -p input
tar xvfz svdss-data.tar.gz -C input
# Download SVDSS binary
wget https://github.com/Parsoa/SVDSS/releases/download/v2.0.0-alpha.1/SVDSS_linux_x86-64
chmod +x SVDSS_linux_x86-64
# Download snakemake workflow and run it
wget https://raw.githubusercontent.com/Parsoa/SVDSS/master/config.yaml
wget https://raw.githubusercontent.com/Parsoa/SVDSS/master/Snakefile
snakemake -p -j 2
# Alternatively, you can use the bash helper script
wget https://raw.githubusercontent.com/Parsoa/SVDSS/master/tests/run-svdss.sh
bash run-svdss.sh ./SVDSS_linux_x86-64 input/22.fa input/22.bam svdss-output
SVDSS is developed by Luca Denti, Parsoa Khorsand, and Thomas Krannich.
For inquiries on this software please open an issue.
SVDSS is published in Nature Methods.
Instructions on how to reproduce the experiments described in the manuscript can be found here (also provided as submodule of this repository).