ekg / edyeet

base-accurate DNA sequence alignments using edlib and mashmap2
MIT License
33 stars 3 forks source link

edyeet

edyeet is a fork of MashMap that implements base-level alignment using edlib, via the wflign tiled wavefront global alignment algorithm. It completes MashMap with a high-performance alignment module capable of computing base-level alignments for very large sequences.

process

Each query sequence is broken into non-overlapping pieces defined by -s[N], --segment-length=[N]. These segments are then mapped using MashMap's sliding minhash mapping algorithm and subsequent filtering steps. To reduce memory, a temporary file is used to store initial mappings. Each mapping location is then used as a target for alignment using edlib.

The resulting alignments always contain extended CIGARs in the cg:Z:* tag. Approximate mapping (equivalent to MashMap) can be obtained with -m, --approx-map.

Mapping merging is disabled by default, as aligning merged approximate mappings with edlib under reasonable identity bounds can generate very long runtimes. However, merging can be useful in some settings and is enabled with -M, --merge-mappings.

Sketching, mapping, and alignment are all run in parallel using a configurable number of threads. The number of threads must be set manually, using -t, and defaults to 1.

usage

edyeet has been developed to accelerate the alignment step in variation graph induction (the first step in the seqwish / smoothxg pipeline). Suitable default settings are provided for this purpose.

Four parameters shape the length, number, and identity of the resulting mappings:

Together, these settings allow us to precisely define an alignment space to consider. During all-to-all mapping, -X can additionally help us by removing self mappings from the reported set, and -Y extends this capability to prevent mapping between sequences with the same name prefix.

examples

Map a set of query sequences against a reference genome:

edyeet reference.fa query.fa >aln.paf

Setting a longer segment length to reduce spurious alignment:

edyeet -s 50000 reference.fa query.fa >aln.paf

Self-mapping of sequences:

edyeet -X query.fa query.fa >aln.paf

sequence indexing

edyeet provides a progress log that estimates time to completion. This depends on determining the total query sequence length. To prevent lags when starting a mapping process, users should apply samtools index to index query and target FASTA sequences. The .fai indexes are then used to quickly compute the sum of query lengths.

installation

The build is orchestrated with cmake:

cmake -H. -Bbuild && cmake --build build -- -j 16

The edyeet binary will be in build/bin. To clean up, just remove the build directory.

publications