ZekunYin / RabbitMash

RabbitMash: an efficient highly optimized implementation of Mash.
Other
21 stars 4 forks source link

RabbitMash

RabbitMash is an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization, and fast I/O.

Installation

The easiest way to use RabbitMash is to grab a binary release from here. Currently, RabbitMash only supports x86_64 Linux. And we provide four versions of RabbitMash for different CPU architectures including:

Version CPU flags
mash_nosimd no requirement
mash_sse4 sse4_1
mash_avx2 avx2
mash_avx512 avx512f avx512bw avx512vl

You can check the CPU Flags by lscpu to select corresponding binary.

All binaries have been tested on both Intel and AMD CPUs, see below.

CPU OS mash_nosimd mash_sse4 mash_avx2 mash_avx512
Intel Xeon Gold 6148 CentOS 8.1 pass pass pass pass
AMD EPYC 7452 CnetOS 7.6 pass pass pass not supported
Intel Core i5-3470 Ubuntu 18.04 pass pass not supported not supported

The binaries should also work on other linuxes. But if you have problem running these binaries, feel free to issue us.

Notes

Process gzipped files

RabbitMash supports plain FASTQ/FASTA and gzipped FASTQ/FASTA file formats. When processing gziped files, the performance of sketch -i and screen operations are limited by decompression speed. Instead of processing gziped files directly, a more efficient strategy is to process these files by two steps: (i) decompress gziped files to FASTQ format by libdeflate or pugz, and (ii) process FASTQ/FASTA files by RabbitMash.

But when you need to sketch large dataset by files, there won't be much performance penalty to process gzipped files.

Build

Dependencies:

Build:

git clone https://github.com/ZekunYin/RabbitMash.git
cd RabbitMash
./bootstrap.sh
./configure [--prefix=...] [--with-capnp=...] [--with-gsl=...] \
            [--enable-simd=yes/no]
make -j4
#optional
make install
#optional
make test

Build dependency-free binary:

git clone https://github.com/ZekunYin/RabbitMash.git
cd RabbitMash
./bootstrap.sh
./configure [--prefix=...] [--with-capnp=...] [--with-gsl=...] \
            [--enable-simd=yes/no] [--enable-static-gsl=yes]     \
            [--enable-static-cpp=yes]
make -j4
#optional
make install
#optional
make test

You can also type ./configure -h for configure help information.

Install dependency on CentOS 8.1 (root user):

sudo dnf install capnproto capnproto-devel gsl gsl-devel

If you are not a root user, you need to build the dependecies from source code.

Simple Usage

sketch:

./mash sketch test/genome1.fna -p nthreads -o test/genome1.fna.msh
./mash sketch test/genome2.fna -p nthreads -o test/genome2.fna.msh

dist:

 ./mash dist test/genome1.fna.msh test/genome2.fna.msh -p nthreads -o dist.bin
 #optional
 ./mash dumpdist test/genome1.fna.msh test/genome2.fna.msh dist.bin -o dist.txt

triangle:

./mash triangle test/genome1.fna.msh -p nthreads -o tri.bin
#optional
./mash dumptri test/genome1.fna.msh tri.bin -o tri.txt

screen:

./mash screen test/genome1.fna.msh test/reads1.fastq -p nthreads > scr.out

Document

RabbitMash is based on Mash . All functions and most parameters of RabbitMash is the same with Mash. Just type mash for command information and type mash <command_name> for help information.

See Mash's document (http://mash.readthedocs.org) for more information.

Different Commands or Parameters to Mash

New parameter

sketch:

-fw #Create mutiple msh files to keep low memory footprint for sketching massive sequences.

dist:

-o <text> #Create binary format result file for better performance. If -o is not specified, text results will be written to stdout.

triangle:

-o <text> #Create binary format result file for better performance. If -o is not specified, text results will be written to stdout.

New Command

mash dumpdist #Convert binary dist results to human-readable texts.
mash dumptri  #Convert binary triangle results to human-readable texts.

Bug Report

All bug reports, comments and suggestions are welcome.

Feel free to open a new issue, normally I can make a response in one day if I'm not on vacation.

Cite

Zekun Yin, Xiaoming Xu, Jinxiao Zhang, Yanjie Wei, Bertil Schmidt, Weiguo Liu, RabbitMash: Accelerating hash-based genome analysis on modern multi-core architectures, Bioinformatics, , btaa754, https://doi.org/10.1093/bioinformatics/btaa754

Limitations