# Make sure you have installed Rust >= 1.72.0-nightly & minimap2 >= 2.23
cargo --version
minimap2 --verion
# Install JTK
git clone https://github.com/ban-m/jtk.git
cd jtk
cargo build --release
./target/release/jtk --help
# Run JTK on a test ONT ultra-long read dataset
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.fastq.gz
gunzip COX_PGF.fastq.gz
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.toml
./target/release/jtk pipeline -p COX_PGF.toml 2> test.log
See the Installation section and How to run JTK section for more details.
jtk
JTK is a targeted diploid genome assembler aimed for haplotype-resolved sequence reconstruction of medically important, difficult-to-assemble regions such as HLA and LILR+KIR regions in a human genome. JTK accurately assembles a pair of two (near-)complete haplotype sequences of a specified genomic region de novo typically from noisy ONT ultra-long reads (and optionally from any other types of long read datasets).
[adapted from Masutani et al., Bioinformatics, 2023]
First, check the version of the Rust language and minimap2 and update them if necessary.
cargo --version
If the version of Rust is smaller than 1.72.0-nightly, run $ rustup update
to update Rust.
minimap2 --verion
If the version of minimap2 is smaller than 2.23 or minimap2 is not installed, install a newer version of minimap2 from its GitHub repository.
Then, compile JTK.
git clone https://github.com/ban-m/jtk.git
cd jtk
cargo build --release
./target/release/jtk --version
./target/release/jtk
is the resulting binary executable of JTK.
[Optional] Lastly, move the executable, ./target/release/jtk
, to any location included in the $PATH
variable.
jtk
JTK has many subcommands corresponding to each specific step, but the following command does everything and is sufficient for most cases:
jtk pipeline -p <config-toml-file>
How to write the TOML-formatted config file, <config-toml-file>
, is described in detail in the sections below: How to run JTK and How to tune JTK.
The full description of all the subcommands of JTK can be viewed with $ jtk --help
:
USAGE:
jtk [SUBCOMMAND]
OPTIONS:
-h, --help Print help information
-V, --version Print version information
SUBCOMMANDS:
assemble Assemble reads.
correct_clustering Correct local clustering by EM algorithm.
correct_deletion Correct deletions of chunks inside the reads.
encode Encode reads by alignments (Internally invoke `minimap2` tools).
encode_densely Encoding homologoud diplotig in densely.
entry Entry point. It encodes a fasta file into JSON file.
estimate_multiplicity Determine multiplicities of chunks.
extract Extract all the information in the packed file into one tsv
help Print this message or the help of the given subcommand(s)
mask_repeats Mask Repeat(i.e., frequent k-mer)
partition_local Clustering reads. (Local)
pick_components Take top n largest components, discarding the rest and empty reads.
pipeline Run pipeline based on the given TOML file.
polish Polish contigs.
polish_encoding Remove nodes from reads.
purge_diverged Purge diverged clusters
select_chunks Pick subsequence from raw reads.
squish Squish erroneous clusters
stats Write stats to the specified file.
In this section, we assume we have the following shell variables with values defined appropriately based on your input data and environment:
Input Data | Bash variable name in this README |
---|---|
Path to the FASTA file of reads (Here we assume 60x ONT ultra-long reads) |
$READS |
Path to the FASTA file of reference genome sequences (e.g. chm13v2.0.fa of T2T-CHM13) |
$REFERENCE |
Chromosome range of the target genomic region (e.g. chr1:10000000-15000000 ) |
$REGION |
Path to the config file for JTK (Template file is provided as described below) |
$CONFIG |
Number of threads | $THREADS |
NOTE:
$REFERENCE
, are used only for extracting reads derived from the target genomic region, $REGION
, and not for assembly itself.$REGION
, should be smaller than 10Mbp and should not start/end within a segmental duplication region.First of all, you need to extract reads originated from the target region, which will be the input reads for JTK.
minimap2
and by using samtools
with the specified chromosome range of the target genomic region:minimap2 -x map-ont -t $THREADS --secondary=no -a $REFERENCE $READS |
samtools sort -@$THREADS -OBAM > aln.bam
samtools index aln.bam
samtools view -OBAM aln.bam $REGION |
samtools fasta > reads.fasta
reads.fasta
, will be the input file of ONT reads for JTK, i.e. $READS
.Then, create a config file for JTK.
example.toml
in the root of this GitHub repository, which is a template for the config file. Users are assumed to copy and modify this file to create their own config file, $CONFIG
. The contents of example.toml
are as follows:# example.toml
### The input file. Fasta and FASTQ is supported. Compressed files are not supported.
input_file = "input.fa"
### The sequencing platform. ONT, CCS, or CLR.
read_type = "ONT"
### The size of the target region, should be <10M. It is OK to use SI suffix, such as M or K.
region_size = "5M"
### Output directory
out_dir = "./"
### Output prefix. The final assembly would be `out_dir/prefix.gfa`.
prefix = "temp"
...
input_file
and region_size
will likely need to be modified.
input_file
must be the same as $READS
prepared in the previous step.region_size
must be calculated from the value of $REGION
(i.e. end position minus start position).sed
is useful for generating a config file from the template without manual edits. For example, the following command assigns the value of $READS
as the name of the input read file.cat example.toml |
sed -e "/^input_file/c input_file = \""$READS"\"" > $CONFIG
Finally, run JTK with the config file.
jtk pipeline -p $CONFIG
out.gfa
Additionally, JTK outputs the following files that are useful for downstream analyses:
<prefix>.sam
(prefix
is defined in the config file)
<prefix>.coverage.tsv
# Check if the version of minimap2 is greater than 2.23
minimap2 --version
# The version of JTK should be greater than 0.1
jtk --version
# Download the test input ONT reads (only of the HLA region of ~5Mbp)
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.fastq.gz
gunzip COX_PGF.fastq.gz
# Download the config file prepared for the test ONT dataset
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.toml
# Run JTK with the test dataset and its associated config file
jtk pipeline -p COX_PGF.toml 2> test.log
After running the commands above, there should exist ./cox_pgf/temp.gfa
, the resulting assembly graph file containing consensus contig sequences.
The config file (whose template is example.toml
) offers several tunable parameters that influences the final assembly result:
purge_copy_num
: JTK discards every chunk whose estimated multiplicity in the underlying genome is greater than this value. Therefore, increasing this value could improve the assembly when the graph is fragmented or when the coverage of edges in the GFA (such as cv:i:12
) is small. However, accurately clustering a chunk whose multiplicity is, say, 12 is quite challenging, and so increasing this value too much can worsen the assembly. The default value, 8, is typically the sweet spot of this trade-off.min_span
: Any repeat in the assembly graph is resolved if the repeat is spanned by at least this number of reads. This value is related to how aggressively JTK resolves repeats (smaller is more aggressive).The following parameter also has an impact on the assembly result, but it is not recommended to "tune" it:
seed
: Seed value for a random number generater.Bansho Masutani banmasutani@gmail.com
Masutani et al., Bioinformatics, 2023