CMU-SAFARI / RawHash

RawHash can accurately and efficiently map raw nanopore signals to reference genomes of varying sizes (e.g., from viral to a human genomes) in real-time without basecalling. Described by Firtina et al. (published at https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440).
https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440
GNU General Public License v3.0
50 stars 5 forks source link
bioinformatics contamination event-detection genome-analysis hash-tables nanopore nanopore-analysis-pipeline nanopore-data nanopore-minion nanopore-reads nanopore-sequencing raw-nanopore-signal-analysis raw-signal rawhash read-mapping relative-abundances seeding segmentation

RawHash and Rawsamble Overview

RawHash (and RawHash2) is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.

Rawsamble is a mechanism that finds overlaps betweel raw signals without a reference genome (all-vs-all overlapping). The overlap information is generated in a PAF output and can be used by assemblers such as miniasm to construct de novo assemblies.

Below figure shows the overview of the steps that RawHash takes to find matching regions between a reference genome and a raw nanopore signal.

To efficiently identify similarities between a reference genome and reads, RawHash has two steps, similar to regular read mapping tools, 1) indexing and 2) mapping. The indexing step generates hash values from the expected signal representation of a reference genome and stores them in a hash table. In the mapping step, RawHash generates the hash values from raw signals and queries the hash table generated in the indexing step to find seed matches. To map the raw signal to a reference genome, RawHash performs chaining over the seed matches.

RawHash can be used to map reads from FAST5, POD5, SLOW5, or BLOW5 files to a reference genome in sequence format.

RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in UNCALLED and Sigmap to report the mapping information.

Recent changes

Installation

git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash2
cd rawhash2 && make

If the compilation is successful, the path to the binary will be bin/rawhash2.

Compiling with HDF5, SLOW5, and POD5

We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash2 provides a flexible Makefile to enable custom compilation of these libraries.

#Provide the path to all of the HDF5/SLOW5/POD5 include and lib directories during compilation
make HDF5_INCLUDE_DIR=/path/to/hdf5/include HDF5_LIB_DIR=/path/to/hdf5/lib \
     SLOW5_INCLUDE_DIR=/path/to/slow5/include SLOW5_LIB_DIR=/path/to/slow5/lib \
     POD5_INCLUDE_DIR=/path/to/pod5/include POD5_LIB_DIR=/path/to/pod5/lib

#Provide the path to only POD5 include and lib directories during compilation
make POD5_INCLUDE_DIR=/path/to/pod5/include POD5_LIB_DIR=/path/to/pod5/lib
#Disables compiling HDF5
make NOHDF5=1

#Disables compiling SLOW5 and POD5
make NOSLOW5=1 NOPOD5=1

Usage

Getting help

You can print the help message to learn how to use rawhash2:

rawhash2

or

rawhash2 -h

Indexing

Indexing is similar to minimap2's usage. We additionally include the pore models located under ./extern

Below is an example that generates an index file ref.ind for the reference genome ref.fasta using a certain k-mer model located under extern and 32 threads.

rawhash2 -d ref.ind -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 ref.fasta

Note that you can directly jump to mapping without creating the index because RawHash2 is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.

Mapping

It is possible to provide inputs as FAST5 files from multiple directories. It is also possible to provide a list of files matching a certain pattern such as test/data/contamination/fast5_files/Min*.fast5

rawhash2 -t 32 ref.ind test/data/contamination/fast5_files/Min*.fast5 test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 -o mapping.paf ref.ind test/data/d1_sars-cov-2_r94/fast5_files

IMPORTANT if there are many fast5 files that rawhash2 needs to process (e.g., thousands of them), we suggest that you specify only the directories that contain these fast5 files

RawHash2 also provides a set of default parameters that can be preset automatically.

rawhash2 -t 32 -x viral ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 -x sensitive ref.ind test/data/d4_green_algae_r94/fast5_files > mapping.paf
rawhash2 -t 32 -x fast ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf

RawHash2 provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting, which is slightly less accurate than the non-minimizer mode but much faster (around 3X).

rawhash2 -t 32 -x faster ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf

The output will be saved to mapping.paf in a modified PAF format used by Uncalled.

Rawsamble (for overlapping and assembly construction)

Our new overlapping mechanism, Rawsamble, is now integrated in RawHash. To create overlaps, you can construct the index from signals and perform overlapping using this index as follows:

rawhash2 -x ava-small -p ../../rawhash2/extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -d ava.ind -t32 test/data/d3_yeast_r94/fast5_files/

Then perform overlapping using this index:

rawhash2 -x ava-small -t32 ava.ind test/data/d3_yeast_r94/fast5_files/ > ava.paf

We provide the following presets for Rawsamble to enable the overlapping mode (shown in the help message):

Rawsamble Presets:
                 - ava      All-vs-all overlapping mode (default for Rawsamble)
                 - ava-sensitive        More sensitive All-vs-all overlapping mode. Can be slightly slower than -ava but likely to generate longer unitigs in downstream asssembly
                 - ava-viral        All-vs-all overlapping for very small genomes such as viral genomes.
                 - ava-large        All-vs-all overlapping for large genomes of size > 10Gb

Potential issues you may encounter during mapping

It is possible that your reads in fast5 files are compressed with the VBZ compression from Nanopore. Then you have to download the proper HDF5 plugin from here and make sure it can be found by your HDF5 library:

export HDF5_PLUGIN_PATH=/path/to/hdf5/plugins/lib

If you have conda you can simply install the following package (ont_vbz_hdf_plugin) in your environment and use rawhash2 while the environment is active:

conda install ont_vbz_hdf_plugin

Reproducing the results

Please follow the instructions in the README file in test.

Upcoming Features

Citing RawHash, RawHash2, Rawsamble, and RawAlign

If you use RawHash (or RawHash2) in your work, please consider citing the following papers:

@article{firtina_rawhash_2023,
    title = {{RawHash}: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes},
    author = {Firtina, Can and Mansouri Ghiasi, Nika and Lindegger, Joel and Singh, Gagandeep and Cavlak, Meryem Banu and Mao, Haiyu and Mutlu, Onur},
    journal = {Bioinformatics},
    volume = {39},
    number = {Supplement_1},
    pages = {i297-i307},
    month = jun,
    year = {2023},
    doi = {10.1093/bioinformatics/btad272},
    issn = {1367-4811},
    url = {https://doi.org/10.1093/bioinformatics/btad272},
}

@article{firtina_rawhash2_2024,
    title = {{RawHash2}: mapping raw nanopore signals using hash-based seeding and adaptive quantization},
    volume = {40},
    issn = {1367-4811},
    url = {https://doi.org/10.1093/bioinformatics/btae478},
    doi = {10.1093/bioinformatics/btae478},
    number = {8},
    journal = {Bioinformatics},
    author = {Firtina, Can and Soysal, Melina and Lindegger, Joël and Mutlu, Onur},
    month = aug,
    year = {2024},
    pages = {btae478},
}

If you use Rawsamble (i.e., all-vs-all overlapping functionality integrated in RawHash2) please consider citing the following work along with RawHash and RawHash2:

@article{firtina_rawsamble_2024,
  title = {{Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism}},
  author = {Firtina, Can and Mordig, Maximilian and Mustafa, Harun and Goswami, Sayan and Mansouri Ghiasi, Nika and Mercogliano, Stefano and Eris, Furkan and Lindegger, Joël and Kahles, Andre and Mutlu, Onur},
  journal = {arXiv},
  year = {2024},
  month = oct,
  doi = {10.48550/arXiv.2410.17801},
  url = {https://doi.org/10.48550/arXiv.2410.17801},
}

If you use RawAlign (i.e., the alignment functionality integrated in RawHash2) please consider citing the following work along with RawHash and RawHash2:

@article{lindegger_rawalign_2023,
    title = {{RawAlign}: {Accurate, Fast, and Scalable Raw Nanopore Signal Mapping via Combining Seeding and Alignment}},
    author = {Lindegger, Joël and Firtina, Can and Ghiasi, Nika Mansouri and Sadrosadati, Mohammad and Alser, Mohammed and Mutlu, Onur},
    journal = {arXiv},
    year = {2023},
    month = oct,
    doi = {10.48550/arXiv.2310.05037},
    url = {https://doi.org/10.48550/arXiv.2310.05037},
}

Acknowledgement

RawHash2 uses klib, some code snippets from Minimap2 (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and the R9.4 segmentation parameters from Sigmap. RawHash2 uses the DTW integration as proposed in RawAlign (please see the citation details above).

We thank Melina Soysal and Marie-Louise Dugua for their feedback to improve the RawHash implementation and test scripts.