art_modern
Modernized ART that is parallelized and modularized using modern C++.
WARNING Largely under development in internal Git hosting. The Git repository on GitHub may not reflect latest status.
ART is an excellent software for simulating reads from a reference genome. However, it comes with following limitations:
So, we developed art_modern
with the following ideas:
htslib
parser allows on-disk random access of enormous genomes without reading them into memory, while the stream
parser allows streaming of FASTA files.Build the project using:
mkdir -p build_release
env -C build_release cmake -DCMAKE_BUILD_TYPE=Release -DCEU_CM_SHOULD_ENABLE_TEST=FALSE ..
env -C build_release make
The project binary will be available at build_release/art_modern
.
stacktrace_backtrace
: See here for details.CMake variables should be set when invoking cmake
. For example,
cmake -DBUILD_SHARED_LIBS=ON
sets BUILD_SHARED_LIBS
to ON
.
BUILD_SHARED_LIBS
: Whether to build shared libraries
ON
(DEFAULT): Will search for shared libraries and use dynamic linking.OFF
: Will search for static libraries and use static linking.USE_HTSLIB
: Use which HTSLib implementation
hts
: Will use the HTSLib found in system.CEU_CM_SHOULD_ENABLE_TEST
: Whether test should be enabled.
CMAKE_BUILD_TYPE
.OFF
: Will disable test.ON
: Will enable test.CEU_CM_SHOULD_USE_NATIVE
: Whether to build the binaries using -mtune=native
, if possible. This would result in faster executable but impaired portability.
OFF
(DEFAULT): Will not build native executables/libraries.ON
: Will not native executables/libraries.CMAKE_BUILD_TYPE
: The CMake build type.
Debug
(DEFAULT): For developers with debugging needs.CEU_CM_SHOULD_ENABLE_TEST
is unset, it will be set to TRUE
.Release
: Optimized executables/libraries without debug symbols.CEU_CM_SHOULD_ENABLE_TEST
is unset, it will be set to FALSE
.RelWithDebInfo
: Optimized executables/libraries with debug symbols.CEU_CM_SHOULD_ENABLE_TEST
is unset, it will be set to TRUE
.The parallelization strategy of different modes and input parsers are as follows:
Parser \ Mode | wgs |
trans |
templ |
---|---|---|---|
memory |
Coverage | Batch | Batch |
htslib |
Coverage | ERROR | ERROR |
stream |
ERROR | Batch | Batch |
Currently, we support input in FASTA and PBSim3 transcripts format.
FOR FASTA FORMAT: For read names, only characters before blank space are read.
A compatibility matrix is as follows:
Parser \ Mode | wgs |
trans |
templ |
---|---|---|---|
memory |
FASTA | FASTA | PBSim3 Transcripts | FASTA | PBSim3 Transcripts |
htslib |
FASTA | FASTA | FASTA |
stream |
FASTA | FASTA | PBSim3 Transcripts | FASTA | PBSim3 Transcripts |
Changes on software function:
wgs
, trans
and templ
, similar to pbsim3
.memory
, htslib
and stream
.se
, pe
and mp
.aln
output format was dropped.Changes on software engineering stuff:
This simulator is based on the works of Weichun Huang whduke@gmail.com et al., under GNU GPL v3 license. The software is originally distributed here with following reference:
The bundled HTSLib library used MIT License with following reference:
Refactor output-related argument parser to OutputDispatcher
.
Design and implement a job scheduling system using boost::lockfree
and boost::signal2
.
Make it faster.
Update the HTSLib CMake routine for setting macros like HAVE_LIBBZ2
correct.
Support running under Microsoft Windows.
Support Illumina Complete Long Read?
Support MAF output format?
This can be done through seqtk
. For example, to split tmp/test_small_pe/NC_001416.1.fq
:
# Read 1
seqtk seq tmp/test_small_pe/NC_001416.1.fq -1 > tmp/test_small_pe/NC_001416.1_1.fq
# Read 2
seqtk seq tmp/test_small_pe/NC_001416.1.fq -2 > tmp/test_small_pe/NC_001416.1_2.fq