ban-m / jtk

JTK -- a regional diploid genome assembler
MIT License
23 stars 2 forks source link
bioinfomatics genome-assembly

JKT -- Targeted Diploid Genome Assembler

GHAPassing

Getting Started

# Make sure you have installed Rust >= 1.72.0-nightly & minimap2 >= 2.23
cargo --version
minimap2 --verion

# Install JTK
git clone https://github.com/ban-m/jtk.git
cd jtk
cargo build --release
./target/release/jtk --help

# Run JTK on a test ONT ultra-long read dataset
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.fastq.gz
gunzip COX_PGF.fastq.gz
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.toml
./target/release/jtk pipeline -p COX_PGF.toml 2> test.log

See the Installation section and How to run JTK section for more details.

Table of Contents

Introduction

JTK is a targeted diploid genome assembler aimed for haplotype-resolved sequence reconstruction of medically important, difficult-to-assemble regions such as HLA and LILR+KIR regions in a human genome. JTK accurately assembles a pair of two (near-)complete haplotype sequences of a specified genomic region de novo typically from noisy ONT ultra-long reads (and optionally from any other types of long read datasets).

[adapted from Masutani et al., Bioinformatics, 2023]

Features (for general users)

Algorithmic Features (for developers)

Installation

Requirements

Step-by-step Instruction

  1. First, check the version of the Rust language and minimap2 and update them if necessary.

    cargo --version

    If the version of Rust is smaller than 1.72.0-nightly, run $ rustup update to update Rust.

    minimap2 --verion

    If the version of minimap2 is smaller than 2.23 or minimap2 is not installed, install a newer version of minimap2 from its GitHub repository.

  2. Then, compile JTK.

    git clone https://github.com/ban-m/jtk.git
    cd jtk
    cargo build --release
    ./target/release/jtk --version

    ./target/release/jtk is the resulting binary executable of JTK.

  3. [Optional] Lastly, move the executable, ./target/release/jtk, to any location included in the $PATH variable.

The Command: jtk

JTK has many subcommands corresponding to each specific step, but the following command does everything and is sufficient for most cases:

jtk pipeline -p <config-toml-file>

How to write the TOML-formatted config file, <config-toml-file>, is described in detail in the sections below: How to run JTK and How to tune JTK.

The full description of all the subcommands of JTK can be viewed with $ jtk --help:

USAGE:
    jtk [SUBCOMMAND]

OPTIONS:
    -h, --help       Print help information
    -V, --version    Print version information

SUBCOMMANDS:
    assemble                 Assemble reads.
    correct_clustering       Correct local clustering by EM algorithm.
    correct_deletion         Correct deletions of chunks inside the reads.
    encode                   Encode reads by alignments (Internally invoke `minimap2` tools).
    encode_densely           Encoding homologoud diplotig in densely.
    entry                    Entry point. It encodes a fasta file into JSON file.
    estimate_multiplicity    Determine multiplicities of chunks.
    extract                  Extract all the information in the packed file into one tsv
    help                     Print this message or the help of the given subcommand(s)
    mask_repeats             Mask Repeat(i.e., frequent k-mer)
    partition_local          Clustering reads. (Local)
    pick_components          Take top n largest components, discarding the rest and empty reads.
    pipeline                 Run pipeline based on the given TOML file.
    polish                   Polish contigs.
    polish_encoding          Remove nodes from reads.
    purge_diverged           Purge diverged clusters
    select_chunks            Pick subsequence from raw reads.
    squish                   Squish erroneous clusters
    stats                    Write stats to the specified file.

How to Run JTK

Input

In this section, we assume we have the following shell variables with values defined appropriately based on your input data and environment:

Input Data Bash variable name in this README
Path to the FASTA file of reads
(Here we assume 60x ONT ultra-long reads)
$READS
Path to the FASTA file of reference genome sequences
(e.g. chm13v2.0.fa of T2T-CHM13)
$REFERENCE
Chromosome range of the target genomic region
(e.g. chr1:10000000-15000000)
$REGION
Path to the config file for JTK
(Template file is provided as described below)
$CONFIG
Number of threads $THREADS

NOTE:

Step-by-step Usage

  1. First of all, you need to extract reads originated from the target region, which will be the input reads for JTK.

    • This can be done by mapping all the reads to the reference genome with minimap2 and by using samtools with the specified chromosome range of the target genomic region:
    minimap2 -x map-ont -t $THREADS --secondary=no -a $REFERENCE $READS |
        samtools sort -@$THREADS -OBAM > aln.bam
    samtools index aln.bam
    samtools view -OBAM aln.bam $REGION |
        samtools fasta > reads.fasta
    • Here the resulting file, reads.fasta, will be the input file of ONT reads for JTK, i.e. $READS.
  2. Then, create a config file for JTK.

    • There is a file named example.toml in the root of this GitHub repository, which is a template for the config file. Users are assumed to copy and modify this file to create their own config file, $CONFIG. The contents of example.toml are as follows:
    # example.toml
    
    ### The input file. Fasta and FASTQ is supported. Compressed files are not supported.
    input_file = "input.fa"
    ### The sequencing platform. ONT, CCS, or CLR.
    read_type = "ONT"
    ### The size of the target region, should be <10M. It is OK to use SI suffix, such as M or K.
    region_size = "5M"
    
    ### Output directory
    out_dir = "./"
    ### Output prefix. The final assembly would be `out_dir/prefix.gfa`.
    prefix = "temp"
    
    ...
    • In many cases, at least input_file and region_size will likely need to be modified.
      • The value of input_file must be the same as $READS prepared in the previous step.
      • The value of region_size must be calculated from the value of $REGION (i.e. end position minus start position).
    • Other options and parameters are explained in detail in the next section: How to Tune JTK.
    • sed is useful for generating a config file from the template without manual edits. For example, the following command assigns the value of $READS as the name of the input read file.
    cat example.toml |
        sed -e "/^input_file/c input_file = \""$READS"\"" > $CONFIG
    • Specifying absolute paths is recommended for the input read file and the output directory.
  3. Finally, run JTK with the config file.

    jtk pipeline -p $CONFIG
    • This creates several JSON files and GFA files of assmbly graphs as described in detail below.
    • JTK creates temporary files at the current directory. Please execute it at the location where you have a write permission.

Output Files

Additionally, JTK outputs the following files that are useful for downstream analyses:

Running JTK on a Test Data Set

# Check if the version of minimap2 is greater than 2.23
minimap2 --version
# The version of JTK should be greater than 0.1
jtk --version
# Download the test input ONT reads (only of the HLA region of ~5Mbp)
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.fastq.gz
gunzip COX_PGF.fastq.gz
# Download the config file prepared for the test ONT dataset
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.toml
# Run JTK with the test dataset and its associated config file
jtk pipeline -p COX_PGF.toml 2> test.log

After running the commands above, there should exist ./cox_pgf/temp.gfa, the resulting assembly graph file containing consensus contig sequences.

How to Tune JTK

The config file (whose template is example.toml) offers several tunable parameters that influences the final assembly result:

The following parameter also has an impact on the assembly result, but it is not recommended to "tune" it:

Limitation

Contact

Bansho Masutani banmasutani@gmail.com

Citation

Masutani et al., Bioinformatics, 2023

TODO for this README