Whole-Genome Dosage: a suite of tools to evaluate dosage in whole-genome sequencing libraries
Contact: Ryan Collins (rlcollins@g.harvard.edu)
All code copyright (c) 2018 Ryan Collins and is distributed under terms of the MIT license.
As of bedtools 2.24.0
, some arguments and parameters changed for bedtools coverage
, an essential element of the WGD workflow. The WGD pipeline has been tested with bedtools 2.20.1
and bedtools 2.25.1
, thus includes compatibility for both "new" and "old" parameter specifications, but one step (binCov.py
) will require an argument to specify which version of bedtools
you are using. See more details in the usage note for binCov.py
The WGD pipeline requires the following:
Binned coverage is calculated by binCov.py
on a per-chromosome basis. For human whole-genome sequencing analyses, nucleotide coverage at bin sizes of 100bp is recommended. Only primary contigs recommended; e.g. 1...Y. Parallelization of this process is encouraged.
There are two approaches to parallelization, depending on your available computational resources. Examples are given below using LSF as a scheduler, but could be easily configured to your scheduler/environment.
Fully parallelized approach (almost always preferable)
#!/usr/env/bin bash
while read sample; do
while read contig; do
bsub "binCov.py ${sample}.bam ${contig} ${sample}.${contig}.rawCov.bed \
-n ${sample}.${contig}.normCov.bed \
-b 100 \
-m nucleotide \
-x /path/to/Nmask.bed"
done < list_of_contigs.txt
done < list_of_samples.txt
Alternatively, if available cores are limited or sample sizes are large, a wrapper script, WG_binCov.sh
, will calculate normalized coverage for a set of contigs in serial from a single bam.
Semi-parallelized approach (preferred if available cores are dramatically fewer than [#contigs x #samples])
#!/usr/env/bin bash
while read sample; do
bsub "WG_binCov.sh ${sample}.bam ${sample} `pwd` \
-L list_of_contigs.txt \
-b 100 \
-m nucleotide \
-x /path/to/Nmask.bed"
done < list_of_samples.txt
Once you have 100bp binCov files per chromosome computed for all samples, the next step is to apply 6F correction. This is accomplished with the script multiCorrection.R
in WGD/bin/
.
To run 6F correction, the multiCorrection.R
script requires a tab-delimited input file with the following three columns for each chromosome:
The latter component, the 6F metadata matrix, is currently available upon request for hg19 and GRCh37. Additional references are forthcoming.
An example head & tail of the expected input file for multiCorrection.R
should be:
/path/to/raw_binCov.chr1.bed.gz /path/to/output_corrected_binCov.chr1.bed.gz /path/to/6F_correction_matrix.chr1.bed.gz
/path/to/raw_binCov.chr2.bed.gz /path/to/output_corrected_binCov.chr2.bed.gz /path/to/6F_correction_matrix.chr2.bed.gz
/path/to/raw_binCov.chr3.bed.gz /path/to/output_corrected_binCov.chr3.bed.gz /path/to/6F_correction_matrix.chr3.bed.gz
...
/path/to/raw_binCov.chr22.bed.gz /path/to/output_corrected_binCov.chr22.bed.gz /path/to/6F_correction_matrix.chr22.bed.gz
/path/to/raw_binCov.chrX.bed.gz /path/to/output_corrected_binCov.chrX.bed.gz /path/to/6F_correction_matrix.chrX.bed.gz
/path/to/raw_binCov.chrY.bed.gz /path/to/output_corrected_binCov.chrY.bed.gz /path/to/6F_correction_matrix.chrY.bed.gz
Once you've compiled this input file per sample, you can call the 6F correction script as follows:
$ WGD/bin/multiCorrection.R -z /path/to/6F_correction_input_file.txt
This job should require less than a couple hours to run on most standard machines.
Once 6F correction is complete for all chromosomes per sample, you next need to run the WGD dosage bias scoring model. This has three intermediate steps, detailed below.
Step 2.1: Generate a WGD scoring input file per sample. This is done by combining all 6F-adjusted binCov files across all chromosomes, then filtering them down to a specific set of bins we know to be informative for dosage bias scoring. This is accomplished with a four-liner, as follows:
zcat /path/to/output_corrected_binCov.chr*.bed.gz | \
bedtools intersect -f 1.0 -r -wa -a - \
-b WGD/refs/WGD_scoring_mask.6F_adjusted.100bp.h37.bed > \
/path/to/WGD_scoring_file_output.bed
Step 2.2: Combine WGD scoring input files across all samples into a WGD scoring matrix. This is done the same way you would create any binCov matrix (e.g. before running cn.MOPS or RdTest), using the makeMatrix.sh
script in WGD/bin/
. Example usage:
#First, generate makeMatrix input file
while read ID; do
echo -e "${ID}\t/path/to/${ID}.6F_adjusted.WGD_scoring_bins.bed.gz"
done < samples.list > \
/path/to/WGD_scoring.makeMatrix_input.txt
#Next, make the matrix
/WGD/bin/makeMatrix.sh -z \
-o /path/to/output_WGD_scoring_masked.matrix.bed \
-r /WGD/refs/WGD_scoring_mask.6F_adjusted.100bp.h37.bed \
/path/to/WGD_scoring.makeMatrix_input.txt
Step 2.3: Run the WGD scoring model. Finally, you run the WGD scoring model on the matrix you created above with the following command:
/WGD/bin/scoreDosageBiases.R -z \
-O /path/to/WGD_model_output/ \
/path/to/output_WGD_scoring_masked.matrix.bed.gz \
/WGD/refs/WGD_scoring_mask.6F_adjusted.100bp.h37.bed
The output from the model will be a list of dosage scores, and a pdf of two plots visualizing the dosage score distributions for all samples.
Similar to the WGD scoring model above, the ploidy estimation model has four sequential steps, detailed below.
Step 3.1: Recompress binCov data per sample per chromosome to 1Mb bins. This is accomplished using the compressCov.sh
script shipped with the WGD repo (under WGD/bin/
). Example usage below:
#Do this per chromosome per sample
/WGD/bin/compressCov.sh -N -z -s \
-o /path/to/6F_adjusted.1Mb_bins.binCov.bed.gz \
/path/to/6F_adjusted.100bp_bins.binCov.bed.gz \
10000
Step 3.2: Combine all recompressed binCov files into a single file per sample. This is done using the output from step 3.1, and can be accomplized using something simple like zcat
or cat
.
Step 3.3: Create matrix of 1Mb recompressed binCov values across all samples. This is done using the output from step 3.2, and can be done using the same instructions as in step 2.2.
Step 3.4: Run the ploidy estimation model. This is performed on the 1Mb coverage matrix from step 3.3. with the estimatePloidy.R
script. Example usage below:
/WGD/bin/estimatePloidy.R -z \
-O /path/to/ploidy_estimate_output_directory/ \
/path/to/6F_adjusted.1Mb_bins.binCov_matrix.bed.gz
The output from this ploidy estimation script is a large number of files and plots, all of which summarize and/or visualize per-sample ploidy per chromosome.
Median values can be evaluated with medianCoverage.R
. This tool operates in two modes (1) medians per bin across all samples, and (2) medians per sample across all bins. For each, median coverages are returned when considering all data and also when excluding bins (or samples) with zero coverage. See documentation of medianCoverage.R for details of this process.
CNView is a companion tool for WGD that can visualize, score, and annotate CNVs directly from raw makeMatrix.sh
output. More details on CNView can be found here. If you use CNView, please cite Collins et al., 2016.
Iterates through a single chromosome of a bam file and calculates either nucleotide or physical coverage in regularly segmented bins.
usage: binCov.py [-h] [-S] [-C] [-I INDEX_PATH] [-z] [-n NORM_OUT]
[-b BINSIZE] [-m {nucleotide,physical}] [-x BLACKLIST]
[-v OVERLAP] [--oldBT]
bam chr cov_out
Calculates non-duplicate primary-aligned binned coverage of a chromosome from
an input BAM file
positional arguments:
bam Input bam (or sam/cram, but requires appropriate flag)
chr Contig to evaluate
cov_out Output bed file of raw coverage
optional arguments:
-h, --help show this help message and exit
-S, --SAM Input file is in sam format
-C, --CRAM Input file is in cram format
-I INDEX_PATH, --index_path INDEX_PATH
Bam/cram index file
-z, --gzip Gzip output files bed files
-n NORM_OUT, --norm_out NORM_OUT
Output normalized coverage
-b BINSIZE, --binsize BINSIZE
Bin size, in bp (default: 1000)
-m {nucleotide,physical}, --mode {nucleotide,physical}
Evaluate nucleotide or physical coverage (default:
nucleotide)
-x BLACKLIST, --blacklist BLACKLIST
BED file of regions to ignore
-v OVERLAP, --overlap OVERLAP
Maximum tolerated blacklist overlap before excluding
bin
--oldBT Flag to indicate if you are using a BEDTools version
pre-2.24.0
Usage Notes:
-v
percent overlap by size with blacklisted regions (-x
or --blacklist
). bedtools coverage
syntax assuming bedtools
version 2.24.0 or later (i.e. -a
is bins and -b
is reads; this was reversed starting in bedtools v2.24.0
). Specify option --oldBT
to revert to bedtools syntax pre-2.24.0. Wrapper for serialized execution of binCov.py across multiple chromosomes for an individual sample
usage: WG_binCov.sh [-h] [-b BINSIZE] [-m MODE]
[-L CONTIGS] [-x BLACKLIST] [-v OVERLAP]
BAM ID OUTDIR
Wrapper for serialized execution of binCov.py across multiple chromosomes
Positional arguments:
BAM Input bam
ID Sample ID
OUTDIR Output directory
Optional arguments:
-h HELP Show this help message and exit
-b BINSIZE Bin size in bp (default: 1000)
-m MODE Evaluate physical or nucleotide coverage (default: nucleotide)
-L CONTIGS List of contigs to evaluate (default: all contigs in bam header)
-x BLACKLIST BED file of regions to ignore
-v OVERLAP Maximum tolerated blacklist overlap before excluding bin
Usage Notes:
Helper tool to automate creation of sorted coverage matrices from binCov.py output bed files. Wraps bedtools unionbedg
. Note that, in most cases, multiple chromosome outputs from binCov.py from the same sample should be concatenated before being passed to this tool.
usage: makeMatrix.sh [-h] [-z] [-o OUTFILE] SAMPLES
Helper tool to automate creation of sorted coverage matrices from
binCov.py output bed files
Positional arguments:
SAMPLES List of samples and coverage files (tab-delimmed)
Optional arguments:
-h HELP Show this help message and exit
-z GZIP Gzip output file
-o OUTFILE Output file (default: stdout)
Usage Notes:
Helper tool to calculate median coverages per-bin across all samples or per-sample across all bins. Medians are reported both with and without considering zero-coverage bins or samples. Uses a sorted coverage matrix as can be generated by makeMatrix.sh (see above).
usage: medianCoverage.R [-h] [-b/--binwise] covMatrix.bed OUTFILE
Helper tool to calcualte √arious medians from binCov matrices
Positional arguments:
covMatrix.bed path to binCov coverage matrix
OUTFILE full path to desired output file
Optional arguments:
-h HELP Show this help message and exit
-b/--binwise Compute medians of all samples per bin [default: median of all bins per sample]
Usage Notes:
Helper tool to compress bincov output (individual sample files or mutli-sample matrices) into larger bin sizes. By default, median values per new bin are reported.
usage: compressCov.sh [-h] [-z] [-n] [-o OUTFILE] INPUT RATIO
Helper tool to automate compression of raw binCov.py output bed files or
bed-style coverage matrices into larger bin sizes
Positional arguments:
INPUT path to binCov.py bed file or bed-stype matrix
RATIO compression ratio
Optional arguments:
-h HELP Show this help message and exit
-z GZIP Gzip output file
-s SUM Report sum (default: report median)
-o OUTFILE Output file (default: stdout)
Usage Notes: