cschin / peregrine-2021

Other
58 stars 6 forks source link

Peregrine-2021: A faster and minimum genome assembler

Peregrine-2021 is an genome assembler designed for long-reads that have good enough accuracy. It is written with the Rust language. The main method used in the genome assembler is described in Human Genome Assembly in 100 Minutes.

PeregrineLogo

System requirement:

A modern Linux workstation or compute node with enough disk, CPUs, and RAM. It is better to have a good number of CPUs (my testing system has 20 cores) and a good amount of RAM (~ total 1.5x of the reads data set). For example, for 100G sequences, it is probably good to have at least 150G RAM. A smaller amount, e.g., 32G, works, but you will need some manual setup for effective computation.

Some Ballpark Performance Summary

With a proper hardware (e.g. ~1Tb RAM), Peregrine-2021 had successful assembled a total 30G diploid genome (2n = 30G) with a contig N50 = 55.2Mb for a large diploid genome. (For who might want to know more details, I ran it as a unpaid service so I don't have much other infomation. Link to the the graph of the assembly.)

For a typical human-size assembly, a much cheaper compute instance with from 128G to 512G RAM can work well. (see this blog Accelerating genome assembly with AWS Graviton2), and it takes only 2 to 3 hours wall o'clock time to get an assembly. (We also provide a "fast" mode eliminating one error correct stage for perfect reads as input.)

On the accuracy side, here, we only have some rough numbers from the earlier version compared to other assemblers. Ironically, it could take more effort and resources to do a comprehensive benchmark for a publication than writing the assembler itself. Unfortunately, benchmarking work is a luxury that I currently do not have. However, suppose if anyone is interested in taking a shot running the code fine-tuning the parameters to evaluate the results correctly, in that case, I might help a bit from time to time.

Usage:

  1. Put the paths to the sequence read files (fasta / fasta.gz / fastq / fastq.gz, compressed with the standard gzip but not bgzip) in a file, e.g. reads.lst, so the Peregrin-2021 assembler can find the read data. For example, this shows the content of a reads.lst file:

    $ cat seq.lst 
    /wd/CHM13_20k/m64062_190803_042216.fastq.gz
    /wd/CHM13_20k/m64062_190804_172951.fastq.gz
    /wd/CHM13_20k/m64062_190806_063919.fastq.gz
    /wd/CHM13_20k/m64062_190807_194840.fastq.gz
  2. Make sure your have enough disk (preferably SSD storage or high performance network filesystem) for a working directory. Let’s call the working directory asm_out.

  3. Execute: pg_asm reads.lst asm_out from command line / shell, some potentially useful intermediate files and the assembled contigs will be in the directory asm_out/

$ pg_asm seq.lst asm_out >& log &
  1. There are a number of options that you can try to tune for optimizing the assembly results. Here is the full usage information of pg_asm.
❯ pg_asm --help
pg_asm peregrine-2021 0.4.9 (arm_config:58e666e+, release build, linux [x86_64] [rustc 1.58.0 (02072b482 2022-01-11)])
Jason Chin <jason@omnibio.ai>

Peregrine-2021 genome assembler,
LICENSE: http://creativecommons.org/licenses/by-nc-sa/4.0/

USAGE:
    pg_asm [FLAGS] [OPTIONS] <input_reads> <work_dir> [ARGS]

FLAGS:
        --fast          run the assembler in the fast mode
    -h, --help          Prints help information
        --keep          keep intermediate files
        --no_resolve    disable resolving repeats / dups at the end
    -V, --version       Prints version information

OPTIONS:
    -b, --bestn <bestn>              number of best overlaps for initial graph [default: 6]
    -k <k>                           Kmer size [default: 56]
    -l <layout_method>               layout version [default: 2]
        --log <log>                  log level: DBBUG or INFO (default)
    -c, --min_ec_cov <min_ec_cov>    Minimum error coverage [default: 1]
    -r <r>                           Reduction factor [default: 6]
    -t, --tol <tol>                  Alignment tolerance [default: 0.01]
    -w <w>                           Window size [default: 80]

ARGS:
    <input_reads>    Path to a file that contains the list of reads in .fa .fa.gz .fastq or fastq.gz formats
    <work_dir>       The path to a work directory for intermediate files and the results
    <NTHREADS>       Number of threads
    <NCHUNKS>        Number of partition

You can reduce the Reduction factor and Window Size to increase the sensitivity to detect overlaps. I found -r 4 -w 64 may be better for human assembly than the default parameters.

Example Results: (v0.2.0, main:6d91294, release build, linux [x86_64])

Input data set: Human CHM13 dataset

Run Peregrine-2021 in Fast-mode (without read level error correction)

CPU time for the fast-mode:

usr: 152025s = 42.5 cpu hours, 

sys: 900s = 0.25 cpu hours, 

wall clock time = 2:45:39

(In contrast, it takes HiCanu > 4000 CPU hour to get an assembly)

Assembly Summary Statistics with the fast-mode:

total size: 3,034,243,471
max size: 201,005,110
N50 size: 81,361,265
N90 size: 17,301,175
Number of Contigs: 237
Number of Contigs > 100kb: 151

CHM13 BAC evaluation result with the fast-mode setting:

******************* BAC SUMMARY ******************
 TOTAL    : 341
 BP       : 51532183
************** Statistics for: _asm_p_ctg-2.fa ****************
BACs closed: 327 (95.8944); BACs attempted: 338 %good = 96.7456; BASES 49454907 (95.969)
Median:          99.9844
MedianQV:        38.06875
Mean:            99.93617
MeanQV:          31.94969
***** STATS IGNORING INDELS ********************
Median:          100
MedianQV:        Inf
Mean:            99.99177
MeanQV:          40.84457
**********************************************

(This HiCanu paper preprint reported resolving only 326 out of the 341 BAC BACs.)

CHM13 BAC evaluation result with the fast-mode setting but without final contig level consensus:

(This pretty much just reflects the quality of the input reads.)

******************* BAC SUMMARY ******************
 TOTAL    : 341
 BP       : 51532183
************** Statistics for: asm-p_cns-nocns_fast.fa ****************
BACs closed: 327 (95.8944); BACs attempted: 339 %good = 96.4602; BASES 49454907 (95.969)
Median:          99.7163 
MedianQV:        25.47141 
Mean:            99.67459 
MeanQV:          24.87566 
***** STATS IGNORING INDELS ********************
Median:          99.9782 
MedianQV:        36.61544 
Mean:            99.97005 
MeanQV:          35.23557 
**********************************************

Run Peregrine-2021 in Standard-mode (one round of the read level error correction)

CPU time for the standard setting which involves one round of read level error correction:

usr: 355257s = 98.7 cpu hours,  

sys: 2052s = 0.6 cpu hours, 

wall clock time = 5:47:23

(In contrast, it takes HiCanu > 4000 CPU hour to get an assembly)

Assembly Summary Statistics of the standard Setting:

total size: 3,039,592,838
max size: 142,204,433
N50 size: 83,143,579
N90 size: 16,250,509
Number of Contigs: 227
Number of Contigs > 100kb: 157

CHM13 BAC evaluation result:

******************* BAC SUMMARY ******************
 TOTAL    : 341
 BP       : 51532183
************** Statistics for: _asm_p_ctg.fa ****************
BACs closed: 330 (96.7742); BACs attempted: 341 %good = 96.7742; BASES 49874385 (96.783)
Median:          99.9911
MedianQV:        40.5061
Mean:            99.94317
MeanQV:          32.45434
***** STATS IGNORING INDELS ********************
Median:          100
MedianQV:        Inf
Mean:            99.99405
MeanQV:          42.25218
**********************************************

CHM13 BAC evaluation result without final contig level consensus:

******************* BAC SUMMARY ******************
 TOTAL    : 341
 BP       : 51532183
************** Statistics for: asm-p_cns-nocns.fa ****************
BACs closed: 330 (96.7742); BACs attempted: 341 %good = 96.7742; BASES 49874385 (96.783)
Median:          99.9832
MedianQV:        37.74691
Mean:            99.93411
MeanQV:          31.81188
***** STATS IGNORING INDELS ********************
Median:          99.9994
MedianQV:        52.21849
Mean:            99.9927
MeanQV:          41.36713
**********************************************

Some FAQs:

Peregrine-2021 is still just another OLC assembler.

  1. It uses the SHIMMER index for finding overlap candidates as described in the Peregrine preprint but it is more aggressive for overlapping repetitive reads rather than filtering them out.

  2. The overlaper performs analysis to identify read overlaps within the same haplotype. We did not use a de-Bruijn graph approach for this. We think our method is likely more computational efficient than using a de-Bruijn graph to separate haplotypes.

  3. It adapts the techniques using partial homopolymer compression for separating the reads from different haplotype.

Build For X86_64

  1. Check Rust Installation

  2. Run cargo install or cargo build. Make sure you set up the environment variable PATH to the directory of the built binaries or you can run the excutable pg_asm with full path. If you use cargo build, make sure you compile it with the --release option for optimization.

  3. pg_asm will run the assembly pipeline end-to-end. If it fails, it does not re-use the existing data when one runs pg_asm again. The assemblers is much faster than other assemblers, so it is less important to re-use intermidate data. That has been said, the built will contains executables (e.g. pg_build_idx, pg_ovlp, etc.) for each assembly steps which one can chain them together with their favorite workflow engine for re-using and re-starting an assembly pipeline.

  4. A Dockerfile is provided for creating a Docker image. It also provide information to build the assembler from a clean environment.

  5. To compile in aarch64, it will need some configuration changes to get the best performance. The memory alloctor package needs to be patched for aarch64. See https://github.com/cschin/mimalloc_rust/tree/aarch64_build.

Other utility command line tools

pg_build_sdb  # convert fasta/fastq/fasta.gz/fastq.gz read data into a simple binary data base for the assembler to fetch the reads. 

pg_build_idx  # build the SHIMMER index from the reads for overlapping

pg_build_sdb  # build the sequence database

pg_dedup      # perform all contigs to all contigs alignment to remove duplicates 

pg_dp_graph   # take overlap data file as input to generate the layout file using a polyploid aware layout algorithm

pg_getreads   # generate fasta file for a subset of reads from the sequence database

pg_graph      # (obsoleted) convert the overlap information between the reads into an assembly group

pg_layout     # convert the assembly graph to paths and generate the contig fasta file

pg_ovlp_ec    # perform error correction from the haplotype specific overlaps

pg_ovlp       # generate haplotype specific overlaps between the reads

pg_resolve    # this tool aligns all contigs to themselve to identify haplotype-related contigs

-- Jason Chin (twitter: @infoecho)

first version: Nov. 16, 2020

current version: Fab. 5, 2022