GeorgetteTanner / w-Wessim2

A tool for in silico whole-exome sequencing.
GNU General Public License v3.0
2 stars 2 forks source link

w-Wessim2

Description

w-Wessim2 is an in silico whole exome sequencing (WES) method for generating simulated reads from a given genome. It defines regions for sequencing by using a BLAT alignment of probes to the genome being sequenced, thereby simulating the exon capture hybridisation during library preparation, and capturing the effects of copynumber mutations.

w-Wessim2 is adapted from w-Wessim, described in Tanner et al., 2019 (https://doi.org/10.1093/bioinformatics/bty1063). Improvements to w-Wessim2 include using ReSeq (Schmeing S and Robinson M, 2020) for sequencing error modelling, as well as requiring less computational resources.

Citation

Please cite the following when using w-Wessim2:

w-Wessim:

Tanner G, Westhead DR, Droop A, Stead LF. Simulation of heterogeneous tumour genomes with HeteroGenesis and in silico whole exome sequencing. Bioinformatics. 2019 Jan 4;35(16):2850–2.

Wessim:

Kim S, Jeong K, Bafna V. Wessim: a whole-exome sequencing simulator based on in silico exome capture. Bioinformatics. 2013;29:1076–7.

ReSeq:

Schmeing S, Robinson D. ReSeq simulates realistic Illumina high-throughput sequencing data. bioRxiv 2020.07.17.209072.

Versions

v1.0 Initial release

Requirements

Programs required:

BLAT

BLAT is required to generate an alignment of the probe sequences to the genome the user wishes to sequence. It takes a list of hybridisation probe sequences in FASTA format and a genome sequence, and outputs a .psl file listing locations of all alignments (that meet the given stringency parameters) for each probe.

We reccomend using the multi-threaded pblat version of BLAT, created by Wang Meng and available from http://icebert.github.io/pblat/. Alternatively, the single threaded original BLAT program can be downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/ and selecting your computer type. You may need to make the file an executable with:

chmod a+x ./blat

The -minScore and -minIdentity parameter for BLAT can be used to adjust how stringent the alignment is. We find that 95 for both gives the most realistic WES coverage results with w-Wessim2 and is also likely to accurately model the effect of variants on probe hybridisation in WES.

faToTwoBit

faToTwoBit is required to generate a 2bit format of the genome for use with (p)BLAT and can also be downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/.

You may need to make the file an executable with:

chmod a+x ./fatotwobit

ReSeq

The seqToIllumina module of ReSeq is used to incorporate errors into the reads. ReSeq can be installed using conda:

conda install -c bioconda -c conda-forge reseq

or by downloading it from github and following installation instructions: https://github.com/schmeing/ReSeq.

Inputs

paste - - - - < file.fastq | cut -f 1,2 | \
sed 's/^@/>/' | tr "\t" "\n" > file.fa

Probes can be downsampled to the required number {NUM} with:

cat file.fa| paste - - > shuf -n {NUM} | \
sed 's/^.//' > shuf_file.fa
sort -k1 -n shuf_file.fa > sort_file.fa 
sed -e 's/^/>/' sort_file.fa | tr '\t' '\n' > \
downsampled_{NUM}_file.fa

A test dataset consisting of 1x10^5 of the ERR2752113 reads can be downloaded using:

 wget https://github.com/GeorgetteTanner/data/raw/master/real_wes_reads_probes_subsampled.fa.gz

Resouces

The BLAT alignment takes ~200hours and ~5GB memory on a single thread when used with 1x10^8 probe sequences. This can be shortened by 1) multithreading with pBLAT, 2) splitting the probes up and running each set in parallel in separate runs, followed by combining the resulting .psl files, or 3) downsampling probe numbers as discussed above.

w-Wessim2 takes ~2hours and ~32GB memory when run with a BLAT alignment of 1x10^8 probes and the hg38 human reference genome, to generate 1x10^8 fragments. The memory requirement is dependent on both genome size and probe number, and runtime is mainly dependent on the number of simulated fragments.

ReSeq takes ~40min and ~6GB memory to process the genome with replanceN and illuminaPE, and ~300min and ~500MB memory on a single thread to convert the simulated fragments from w-Wessim2 to reads. This can be multithreaded to reduce runtime.

Parameters

Parameter Description Default Value
-R Genome sequence in FASTA format - must be indexed with faidex. Required
-B BLAT allignment output file of probes to the genome, in .psl format. Required
-S Systematic errors file for genome. Required
-O Output file name. Required
-N Number of fragments to simulate. Required
-f Mean fragement size (when using paired-end sequencing). 200
-d Standard deviation of fragment size. 50
-m Minimum fragment lenngth. read length + 20
-y Minimum required fraction of probe match to be hybridized. 50

Use

#Download w-Wessim2
git clone 
https://github.com/GeorgetteTanner/w-Wessim2.git

#Randomly replace all Ns in the genome being sequenced
reseq replaceN -r {GENOME}.fasta -R {GENOME}_noNs.fasta

#Generate systematic errors for the genome 
reseq illuminaPE -r {GENOME}_noNs.fasta -s {ReSeq_profile}.reseq \
--stopAfterEstimation --writeSysError {GENOME}_noNs_syserrors.fq

#Convert genome to 2bit format
faToTwoBit {GENOME}_noNs.fasta {GENOME}_noNs.fasta.2bit

#Generate BLAT alignment 
pblat {GENOME}_noNs.fasta.2bit {PROBES}.txt {PROBES}_{GENOME}_noNs.psl \
-threads=24 -minScore=95 -minIdentity=95

#Generate fragments with w-Wessim2
python2 w-Wessim2.py -R {GENOME}_noNs.fasta -S \
{GENOME}_noNs_syserrors.fq -B {PROBES}_{GENOME}_noNs.psl -N \
{READNUM} -O {GENOME}_w-wessim2_{READNUM}.fa \
-T {ReSeq_profile}.reseq -m 20 -f 170 -d 35

#Create reads with ReSeq
reseq seqToIllumina -j 24 -s {ReSeq_profile}.reseq -i \
{GENOME}_w-wessim2_{READNUM}.fa -o \
{GENOME}_w-wessim2_{READNUM}.fq

#Split reads into 1st and 2nd reads
paste - - - - < {GENOME}_w-wessim2_{READNUM}.fq | sed -n \
'0~2p' |  tr "\t" "\n" > {GENOME}_w-wessim2_{READNUM}_1.fq
paste - - - - < {GENOME}_w-wessim2_{READNUM}.fq | sed -n \
'1~2p' |  tr "\t" "\n" > {GENOME}_w-wessim2_{READNUM}_2.fq