CartwrightLab / dawg

Simulating Sequence Evolution
GNU General Public License v2.0
11 stars 3 forks source link

DAWG VERSION 2-CURRENT

Copyright (c) (2004-2010) Reed A. Cartwright - All rights reserved.

DESCRIPTION

Dawg is an application that will simulate sequence evolution with gaps.

ABSTRACT

DNA Assembly with Gaps (Dawg) is an application designed to simulate the evolution of recombinant DNA sequences in continuous time based on the robust general time reversible model with gamma and invariant rate heterogeneity and a novel length-dependent model of gap formation. The application accepts phylogenies in Newick format and can return the sequence of any node, allowing for the exact evolutionary history to be recorded at the discretion of users. Dawg records the gap history of every lineage to produce the true alignment in the output. Many options are available to allow users to customize their simulations and results.

Many tools and procedures exist for reconstructing alignments and phylogenies and estimating evolutionary parameters from extant data. True phylogenies and alignments are known in very rare instances. In the absence of known data with true phylogenies, we are left with using simulations to test the accuracy of such procedures. Proper simulation of sequence evolution should involve both nucleotide substitution and indel formation. However, existing tools for simulating sequence evolution either do not include indels, like Seq-gen or evolver, or include a rather inexact model of indel formation, like Rose. I developed Dawg to fill in these gaps.

CONTACT

racartwright@uh.edu or reed@scit.us

REFERENCE

Cartwright, R.A. (2005) DNA Assembly With Gaps (Dawg): Simulating Sequence Evolution. Bioinformatics 21 (Suppl. 3): iii31-iii38

LICENSE

See copying.txt for license information.

DOWNLOAD

Dawg can be downloaded from http://scit.us/projects/dawg/.

PREREQUISITES

If installing from source you, need to ensure that the development libraries of Boost http://www.boost.org/ and GSL http://www.gnu.org/software/gsl/ are installed and findable on your machine. Dawg can compile with older versions of Boost, but requires a recent version of Boost's Spirit library. If the version of Boost on your machine doesn't comes with a recent version of Spirit, just download a recent version of the boost library and copy the header files for spirit into [dawg source]/src/include/boost/spirit. That way they will be found by the compiler before older versions of spirit.

If you have to install Boost locally on a Unix machine, the following works for me:

cd boost_1_44_0 ./bootstrap.sh --prefix=$HOME ./bjam install cd ../dawg-current cmake -DBOOST_ROOT=$HOME . make

Change as needed.

INSTALLATION

See Dawg's website for binary packages for Windows, Mac OSX, and other systems. Alternatively, you can compile Dawg from the source. Dawg requires CMake 2.6 (http://www.cmake.org/) to build it from sources. Many Unix-like operating systems can install CMake through their package systems. Extract the Dawg source code and issue the following commands in the extracted directory:

cmake . make make install

The '-G' option to cmake is used to specify different build systems, e.g. Unix Makefiles versus KDevelop3 project. The '-D' option to cmake can be used to set different cmake variables from the command line:

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr . make make install

This will build an optimized version of Dawg and install it to '/usr/bin'. To specify your own build flags you need to set the environment variables CFLAGS and LDFLAGS as necessary. Then specify

cmake -DCMAKE_BUILD_TYPE= .

See CMake's manual for additional information.

If you would prefer to run the command line version, then open up a command console through the Visual Studio tools shortcut (or similar shortcut). This will add the required compiler programs to your command console environment. After changing to the source code directory issue the following commands:

cmake -G "NMake Makefiles" . nmake

If successful, you should find dawg.exe in the "src" directory.

If you are trying to compile Dawg on a UNIX machine that does not have CMake installed, and you can't install it from a package, then you may need to install it locally. After downloading and extracting CMake in your home directory, change to its directory and issue the following commands.

./configure --prefix=$HOME make make install

If "make" fails, try using "gmake" instead.

COMMAND LINE USAGE

dawg [options] trick.dawg

Use "dawg --help" for help information. Dawg will read stdin if filename is "-".

Use "dawg --help-trick" for help regarding input files.

DESCRIPTION OF SIMULATION

Dawg splits the simulation of a single replicate into jobs based on the tree and sequence section. Different areas of the tree and different areas of the sequence can have different evolutionary models. By varying the phylogeny in different parts of the sequence you can produce recombinant sequences.

At the first level, the root sequence is split into a series of "segments", which evolve independently, except for deletions which may span many segments. Next the phylogeny of each segment is split into multiple sections allowing different evolutionary models for different parts of the tree. This all can be easily controlled from the input file format.

INPUT FILE FORMAT

Dawg is controlled by a series of input files, referred to as tricks. Tricks contain a series of Sections which define different models for different sequence or tree regions:

[[SectionA]]

Parameter.A = valueA Parameter.B = valueB, valueC

[[SectionB]]

Parameter.C = valueZ Parameter.B = valueD, valueDD

Double square brackets define a new section. At the start of a trick file there is an implied [[initial]] section header, allowing you to skip specifying one if it is unneeded.

By default sections inherit the values of the section above it. Section initial inherits from an implied default section. In the example above, SectionA inherits the values from initial, and SectionB from SectionA. This makes it easy to specify a new section without having to write out everything. For SectionB, Parameter.A has valueA, just like in SectionA.

You can also change the default behavior of inheriting from the previous section:

[[SectionC = SectionA]]

Here SectionC will inherit the values from SectionA, not SectionB. If you use a blank header, [[]], the name of the section will be generated automatically.

Another shortcut is to use parameter headers:

[[SectionA]]

Foo.Bar = A Foo.Par = B

[[SectionB]]

[Foo] Bar = A Par = B

Both sections produce identical results. Parameters in Dawg are named such that the trick files can be simplified using headers. Some parameter headers are also special:

[] # clears the current header [.part] # add a new part to the current header [..part] # replace last part of the header [....part] # replace last two parts

If 'part' is blank it simply deletes the last part of the current header.

Each parameter line in a trick file contains a parameter id, equals, and a list of strings, separated by commas. Ids can contain one or more numbers, letters, dashes, underscores, and periods: e.g. [A-Za-z0-9._-]+. There are four different ways to specify strings.

Bare Strings contain a series of non-space characters except, excluding ,#"[]=()

Tree Strings lack spaces and start with '(' and end with ';'.

Quoted Strings are a series of printable characters between double quotation marks. Newlines are not acceptable

Triple Quoted Strings can contain any character between two sets of three double quotes.

An example of all four string types:

AList = Bare_String, (Tree:0.1,String:0.1);, "Quoted String", """Double Quoted String"""

Comments start with '#' and go to the end of the line.

OUTPUT FILE

Dawg can automatically detect the format of the output file based on its extension. Supported formats and their extensions are:

Clustal: aln Fasta: fasta, fas, fsa Nexus: nexus, nex Phylip: phylip, phy Poo: poo

Dawg also supports the filename format of "ext:file" to output to "file" with the format specified by extension "ext". That way one can use "nex:-" to output to stdout in Nexus format. Partial matches of extensions are allowed.

If the --split option is on, the each replicate will be saved to its own file, based on the filename in the output option.

NOTES

The meaning of the "Params" vector is different for each substitution model.

GTR: Substitution rates A-C, A-G, A-T, C-G, C-T, G-T JC: Ignored K2P: Transition rate, Transversion rate K3P: Alpha (Transitions), Beta (A-T & G-C), Gamma (A-C & G-T) HKY: Transition rate, Transversion rate F81: Ignored F84: Kappa TN: Alpha1 (A-G), Alpha2 (C-T), Beta (Transversions)

Parameter "Freqs" is ignored by the models "JC", "K2P", and "K3P".

If "Lambda" is a single value, then it specifies the rate of indel formation, e.g. "Lambda = 0.1" is the same as "Lambda = {0.05, 0.05}". The first parameter is the insertion rate and the second parameter is the deletion rate.

The first parameter of "GapModel" specifies the distribution model of insertion sizes. The second parameter specifies the distribution model of deletion sizes. If only one parameter is given it is the model for both insertions and deletions.

The first parameter of "GapParams" is a vector specifying the parameters for the gap model of insertions. Likewise the second parameter is a vector specifying the parameters for the gap model of deletions. If "GapParams" is not a vector of vectors, then it specifies the vector of parameters for both insertions and deletions.

The meaning of the GapParams vector is different for each gap model. US: The distribution of gap sizes. NB: The number of failures (r), the probability of success (q). PL: The rate parameter (a), the maximum gap size.

To create a recombinant tree, you may need to specifically describe and label the inner nodes at which the recombination events occur.

Gamma takes precedence over Alpha.

Sequence takes precedence over Length.

If Out.Block.* is the name of a file, the code is read from that file.

The following vector parameters have a size of "Width": "Scale", "Alpha", "Gamma", and "Iota". If their size is less than width then the first value in the vector will be used to fill in the rest of the values, e.g. "Scale = 1.0" is the same as "Scale = {1.0,1.0,1.0}" when "Width = 3".