The legacy version of the algorithm used is described in the paper:
Nucl. Acids Res. (2014) 42 (5): e31. doi: 10.1093/nar/gkt992.
The current version is not published separately in a paper.
This version of WISECONDOR is significantly different from previously published versions. While the general idea of a within-sample comparison is still applied, implementation is re-done from the ground up. If you prefer to stick with the latest version of WISECONDOR as it was, please use the legacy branch. The last known stable release for the legacy version is v2.0.1.
A quick overview of changes made in general compared to the legacy version:
to solve a task rather than finding the right script. This approach is alike using bamtools etc.-mineffect
, default 1.5%) to do so.What was not changed:
What is lost:
If anything is not clear, try the wiki first. If your problem or question is still unsolved, feel free to post a question or submit a bug report on the issue tracker. If you do not get any response, the contact information on my page should help you to get in touch with me.
Obtain the required modules:
pip install sklearn numpy scipy matplotlib pysam futures
You may need to install pip first, and depending on preferences you may add sudo at the beginning of this command.
To get to work without too much reading just use ./
and follow the directions provided.
Binsize parameters can be changed by altering the variables near the start of the script.
WISECONDOR was developed and tested using Python2.7. Using any other version may cause errors or faulty results. This version uses pysam to read .bam files created by BWA.
The list of packages required and tested versions as reported by pip freeze
To start testing for aberrations, WISECONDOR needs to learn how genomic regions behave when compared to each other. To do so, reference data should be prepared in the same way as the test sample data later on, using the same genomic reference, etc.
Settings we used for external tools:
bwa aln -n 0 -k 0
bwa samse -n -1
picardtools sortsam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
Make sure your data is sorted as specified here. Results can be odd and meaningless otherwise.
To convert a .bam file to a file for analysis, use the convert tool in
python convert ./input/sample.bam ./output/sample.npz -binsize 50000
The binsize specified here is 50kb. You can take any value here (assuming it fits in memory) and upscale data to multiples of this value later on. The final scale is determined by the scale set in the reference creation step.
When not specified, this binsize will be set to 1mb, the default for the legacy version of WISECONDOR.
All reference files should be fed into the reference creation tool. Move the reference samples into a separate folder and start the reference-build script:
python newref ./referenceFiles/*.npz ./dataFiles/reference.npz -binsize 250000
The binsize specified here is 250kb. Not specifying this will assume all files provided are at the desired resolution already. This value can only be a multiple of the binsize used for any sample in the reference data.
This is the resolution samples will be scaled to during testing as well. For example, converting a sample at 50kb bins and testing it using a reference at 1mb will provide correct results at 1mb resolution.
You need to feed a single converted file (npz) as explained in 'PREPARING FOR TESTING, File conversion', the path/filename where to save the output data, and the reference as explained and created in 'PREPARING FOR TESTING, Reference creation':
python test ./testSamples/sample.npz ./testSamples/sample_out.npz ./dataFiles/reference.npz
This creates a new npz file, in this example at ./testSamples/sample_out.npz
. This file contains the results but is not directly readable by the user. Instead it contains a lot of information that may or may not be required for reports, as well as information that should not be shown at all in diagnostics.
This step is optional in the sense that you are expected to write your own report functionality. In the legacy version people needed to parse the stdout text to obtain results, making integration with other systems difficult. This version provides a simplified way to allow extracting output for third party applications.
If you want a textual report, use the report function and provide both the npz created by conversion and the output npz files:
python report ./testSamples/sample.npz ./testSamples/sample_out.npz
The file created by the test step can be used as input for the plot tool. This tool turns the prepared data into a visualization, which can be replaced to accommodate for personal preferences without the need to make changes to the original algorithm.
python plot ./testSamples/sample_out.npz ./testSamples/sample_plot
Be aware that things have changed since previous versions:
The plot tool can add a simple cytoband near the bottom of every plot for visual help. The file used for Hg19 can be found at UCSC. Unpack the archive and feed the text file to the plot tool using -cytofile cytoBand.txt
. As long as you keep the same format you can use another cytoband file that matches your reference genome.
Do not use reference data from one lab to test samples from another. Every reference file, laboratory and sequencing machine has its own effect on how read depth per bin behaves. Any results obtained by combining files from different origins are unreliable.
To improve your results you probably want to change a few parameters. Most of the values used in WISECONDOR can be altered using arguments. To find out what arguments can be passed into any script, try running it with -h as argument, for example:
python -h
python convert -h
Add -cpus 4
to newref
to enable multi core reference creation using 4 cores. Replace 4
with any integer matching the amount of cores you have available for a drastic speed up.
Submit reference creation parts as jobs on a compute cluster using three steps instead of newref
. These steps consist of a single-core preparation step, a multi-core step, and finally another single-core step to combine data again. While overkill for 1mb bins, and likely overdone for 250kb bins, this approach becomes more useful when increasing resolution even further (i.e. 50kb).
newrefprep ./refSamples/*.npz ./dataFiles/refprep.npz
newrefpart ./dataFiles/refprep.npz ./dataFiles/refpart 1 4
).newrefpost ./dataFiles/refprep.npz ./dataFiles/refpart 4 ./dataFiles/reference.npz
The single core approach is currently implemented to call all steps in the multicore process in sequential order. The -cpus
option in newref simply applies the cluster approach by itself. Therefore, all approaches use the same code and differences between single an multi core reference creation should be non-existent.
If you do not want to see any results in the plots than aberrations on chromosomes 13 18 and 21, as some medical centers appear to prefer instead of viewing the whole genome, you can add the following options to the plot
-chromosomes 13 18 21
Force only showing these three chromosomes. Replace at will with any selection of autosomal chromosomes you want.
-columns 1
Use the whole page width for chromosomes rather than two columns
If you open the script and find def toolReport(args):
you can see how all information can be obtained from the npz files directly. However, part of the arguments saved in an npz file are functions referenced directly and will be saved as a weird pointer in the npz. If you load one of these files without importing the functions first you may run into some errors. Either fill the missing functions variables with empty stubs or import the actual functions: from wisetools import *