A genome assembly correction and scaffolding pipeline using long reads, consisting of up to three steps:
LongStitch was developed and designed by Lauren Coombe, Janet Li, Theodora Lo and Rene Warren.
If you use LongStitch in your research, please cite:
Coombe L, Li JX, Lo T, Wong J, Nikolic V, Warren RL and Birol I. LongStitch: high-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics 22, 534 (2021). https://doi.org/10.1186/s12859-021-04451-7
LongStitch is available from conda:
conda install -c bioconda -c conda-forge longstitch
All dependencies for LongStitch are also available from homebrew:
brew tap brewsci/bio
brew install tigmint ntlink arcs
Alternatively, use the latest release tarball:
wget https://github.com/bcgsc/LongStitch/releases/download/v1.0.5/longstitch-1.0.5.tar.gz
For example, to run the default pipeline on a draft assembly draft-assembly.fa
with the reads reads.fa.gz
and a genome size of gsize
:
longstitch run draft=draft-assembly reads=reads G=gsize
Note that specifying G
is required when span=auto
for Tigmint-long, and that all input sequences files should be in single-line fasta/fastq format.
The output scaffolds can be found in soft-links with the suffix longstitch-scaffolds.fa
To test your LongStitch installation and see examples of how to run the pipeline, see tests/run_longstitch_demo.sh
To run the demo script, ensure all dependencies are in your PATH, and run the bash script:
cd tests
./run_longstitch_demo.sh
To run the LongStitch pipeline, you can use the Makefile driver script longstitch
.
Usage: ./longstitch [COMMAND] [OPTION=VALUE]…
Commands:
run run default LongStitch pipeline: Tigmint, then ntLink
tigmint-ntLink-arks run full LongStitch pipeline: Tigmint, ntLink, then ARCS in kmer mode
tigmint-ntLink run Tigmint, then ntLink (Same as 'run' target)
ntLink-arks run ntLink, then run ARCS in kmer mode
General options (required):
draft draft name [draft]. File must have .fa extension
reads read name [reads]. The reads file can be uncompressed or gzipped.
Accepted read file extensions: .fq, .fq.gz, .fastq, .fastq.gz, .fa, .fa.gz, .fasta, .fasta.gz
General options (optional):
t number of threads [8]
z minimum size of contig (bp) to scaffold [1000]
out_prefix if supplied, final scaffolds will be soft-linked to <out_prefix>.scaffolds.fa
Tigmint options:
span min number of spanning molecules to be considered correctly assembled [auto]
dist maximum distance between alignments to be considered the same molecule [auto]
G haploid genome size (bp) for calculating span parameter (e.g. '3e9' for human genome). Required when span=auto [0]
longmap long read technology - used for minimap2 preset. 'ont' for nanopore, 'pb' for pacbio, 'hifi' for pacbio HiFi reads [ont]
ntLink options:
k_ntLink k-mer size for minimizers [32]
w window size for minimizers [100]
gap_fill use gap-filling feature [False]
rounds number of ntLink rounds [1]
ARCS+LINKS options:
j minimum fraction of read kmers matching a contigId (used in kmer mode) [0.05]
k_arks size of a k-mer (used in kmer mode) [20]
c minimum aligned read pairs per molecule [4]
l minimum number of links to compute scaffold [4]
a maximum link ratio between two best contain pairs [0.3]
Notes:
- by default, span is automatically calculated as 1/4 of the sequence coverage of the input long reads
- G (genome size) must be specified if span=auto
- by default, dist is automatically calculated as p5 of the input long read lengths
- Ensure that all input files are in the current working directory, making soft-links if needed
k_ntLink
) and w (w
) values for ntLink generally work well, but (depending on your input data) you may get better results by tuning these parameters k_ntLink
(k-mer size): 24-40w
(window size): 100-500 tigmint-ntLink-arks
as the target in your commandrun
, Tigmint-long + ntLink) is recommended. However, if you want to maximize scaffolding and contiguity, running the additional ARKS-long step (tigmint-ntLink-arks
) is often valuableminimap2
is used for mapping reads in the Tigmint step-x
) preset used for mapping, specify longmap=<mode>
longmap=ont
(default), or for PacBio use longmap=pb
-C dir
option with the longstitch
commandlongstitch
- these can either be created manually or using the longstitch make_links
command
reads_path
and draft_path
to be set - full paths to the reads file and draft fasta file, respectivelyLongStitch Copyright (c) 2020 British Columbia Cancer Agency Branch. All rights reserved.
LongStitch is released under the GNU General Public License v3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For commercial licensing options, please contact Patrick Rebstein (prebstein@bccancer.bc.ca).