Scaffolding draft assemblies using reference assemblies and minimizer graphs
ntJoin takes a target assembly and one or more 'reference' assembly as input, and uses information from the reference(s) to scaffold the target assembly. The 'reference' assemblies can be true reference assembly builds, or a different draft genome assemblies.
Instead of using costly alignments, ntJoin uses a more lightweight approach using minimizer graphs to yield a mapping between the input assemblies.
Main steps in the algorithm:
n
)Original concept: Rene Warren
Design and implementation: Lauren Coombe
Thank you for your and for using, developing and promoting this free software!
If you use ntJoin in your research, please cite:
Lauren Coombe, Vladimir Nikolic, Justin Chu, Inanc Birol, Rene L Warren: ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs. Bioinformatics (2020) doi: https://doi.org/10.1093/bioinformatics/btaa253.
Usage: ntJoin assemble target=<target scaffolds> references='List of reference assemblies' reference_weights='List of weights per reference assembly'
Options:
target Target assembly to be scaffolded in fasta format
references List of reference files (separated by a space, in fasta format)
target_weight Weight of target assembly [1]
reference_weights List of weights of reference assemblies
prefix Prefix of intermediate output files [out.k<k>.w<w>.n<n>]
t Number of threads [4]
k K-mer size for minimizers [32]
w Window size for minimizers (bp) [1000]
n Minimum graph edge weight [1]
g Minimum gap size (bp) [20]
G Maximum gap size (bp) (0 if no maximum) [0]
m Minimum percentage of increasing/decreasing minimizer positions to orient contig [90]
mkt If True, use Mann-Kendall Test to predict contig orientation (computationally-intensive, overrides 'm') [False]
agp If True, output AGP file describing output scaffolds [False]
no_cut If True, will not cut contigs at putative misassemblies [False]
overlap If True, attempts to detect and trim overlaps between joined sequences [True]
time If True, will log the time for each step [False]
reference_config Config file with reference assemblies and reference weights as comma-separated values (See README for example)
This is optional, and will override the 'references' and 'reference_weights' variables if specified
Notes:
- Ensure the lists of reference assemblies and weights are in the same order, and that both are space-separated
- Ensure all assembly files are in the current working directory
Running ntJoin help
prints the help documentation.
ntJoin assemble target=my_scaffolds.fa target_weight=1 references='assembly_ref1.fa' reference_weights='2' k=32 w=500
reference1.fa,reference1_weight
reference2.fa,reference2_weight
reference_config
for determining the reference(s) and reference weight(s) instead of references
and reference_weights
reference_config
and the references
variables are specified, reference_config
will override the other variablestests
directory: test_config_single.csv
, test_config_multiple.csv
ntJoin assemble target=my_scaffolds.fa target_weight=1 reference_config=config_file.csv k=32 w=500
<target assembly>.k<k>.w<w>.n<n>.all.scaffolds.fa
)<prefix>.path
)<prefix>.mx.dot
)<prefix>.agp
)n=2
, otherwise use the default n=1
overlap
parameter, and is on overlap=True
by default. To turn this behaviour off, specify overlap=False
conda install -c bioconda -c conda-forge ntjoin=1.1.5
ntJoin can be installed using Homebrew on macOS or Linuxbrew on Linux:
brew install brewsci/bio/ntjoin
curl -L --output ntJoin-1.1.5.tar.gz https://github.com/bcgsc/ntJoin/releases/download/v1.1.5/ntJoin-1.1.5.tar.gz && tar xvzf ntJoin-1.1.5.tar.gz
Python dependencies can be installed with:
pip3 install -r requirements.txt
See tests/test_installation.sh
to test your ntJoin installation and see an example command.
ntJoin Copyright (c) 2020 British Columbia Cancer Agency Branch. All rights reserved.
ntJoin is released under the GNU General Public License v3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca