YU-Zhejian / art_modern

A modernized ART for Illumina read simulation.
GNU General Public License v3.0
0 stars 0 forks source link
illumina-sequencing ngs simulation

Readme for art_modern

Modernized ART that is parallelized and modularized using modern C++.

WARNING Largely under development in internal Git hosting. The Git repository on GitHub may not reflect latest status.

Motivation

ART is an excellent software for simulating reads from a reference genome. However, it comes with following limitations:

So, we developed art_modern with the following ideas:

Quick Start

Build the project using:

mkdir -p build_release
env -C build_release cmake -DCMAKE_BUILD_TYPE=Release -DCEU_CM_SHOULD_ENABLE_TEST=FALSE ..
env -C build_release make

The project binary will be available at build_release/art_modern.

Installation

Dependencies

CMake Variables

CMake variables should be set when invoking cmake. For example,

cmake -DBUILD_SHARED_LIBS=ON

sets BUILD_SHARED_LIBS to ON.

Usage

Mode

The parallelization strategy of different modes and input parsers are as follows:

Parser \ Mode wgs trans templ
memory Coverage Batch Batch
htslib Coverage ERROR ERROR
stream ERROR Batch Batch

Input Formats

Currently, we support input in FASTA and PBSim3 transcripts format.

FOR FASTA FORMAT: For read names, only characters before blank space are read.

A compatibility matrix is as follows:

Parser \ Mode wgs trans templ
memory FASTA FASTA | PBSim3 Transcripts FASTA | PBSim3 Transcripts
htslib FASTA FASTA FASTA
stream FASTA FASTA | PBSim3 Transcripts FASTA | PBSim3 Transcripts

Library Construction Methods

FASTA Parsers

Changes Compared to Official ART Implementation

Changes on software function:

Changes on software engineering stuff:

Acknowledgements

This simulator is based on the works of Weichun Huang whduke@gmail.com et al., under GNU GPL v3 license. The software is originally distributed here with following reference:

The bundled HTSLib library used MIT License with following reference:

TODO

FAQ

How to split produced pair-end/mate-pair sequencing results to 2 files?

This can be done through seqtk. For example, to split tmp/test_small_pe/NC_001416.1.fq:

# Read 1
seqtk seq tmp/test_small_pe/NC_001416.1.fq -1 > tmp/test_small_pe/NC_001416.1_1.fq
# Read 2
seqtk seq tmp/test_small_pe/NC_001416.1.fq -2 > tmp/test_small_pe/NC_001416.1_2.fq