OndrejSladky / fmsi

FMSI is an exact f-masked-superstrings-based index for membership queries and set operations on k-mer sets.
MIT License
9 stars 2 forks source link

FMSI

FMSI test

Introduction

FMSI is a highly memory-efficient tool for performing membership queries on single $k$-mer sets. FMSI uses masked superstrings for storing the $k$-mer sets to ensure high compressibility for a wide range of different $k$-mer sets, and implements FMS-index, a simplification of the FM-index. It supports both streaming and single queries. The functionality implemented in FMSI is based on the following papers:

The memory consumption for FMSI are (w/o kLCP which is additional 1bit/superstring char):

[1] Ondřej Sladký, Pavel Veselý, and Karel Břinda: FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT). to appear on bioRxiv, 2024.

[2] Ondřej Sladký, Pavel Veselý, and Karel Břinda: Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k-Mer Sets. bioRxiv 2024.03.06.583483, 2024. https://doi.org/10.1101/2024.03.06.583483

[3] Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717

image

To construct an index (the fmsi index subcommand), FMSI accepts as input (see the -p parameter) a masked superstring of the $k$-mer set. The masked superstring can be computed by KmerCamel🐫. It then stores the index file in files with the same prefix and the .fmsi.[component] extension. There are these components:

To query the index (the fmsi query subcommand), FMSI accepts a text file with $k$-mers on separate lines to query (see the -q parameter). In a very near future, we will however update this to accept a FASTA file and allow for streaming queries.

Additionally, FMSI experimentally supports set operations using the $f$-masked superstrings framework. However, this feature is currently only experimental and requires rather significant time and memory.

Installation

Prerequisities:

Installation: FMSI can be installed using bioconda:

conda install bioconda::fmsi

Alternatively, FMSI can be built directly from source:

git clone --recursive https://github.com/OndrejSladky/fmsi
cd fmsi && make -j
export PATH="$PATH:$(pwd)"

Usage

k-mer queries (stable)

  1. Compute an optimized masked superstring for all k-mers from a given FASTA file by KmerCamel🐫.

    kmercamel -p input_file.fa -k 31 -c -o ms-no-opt.fa
    kmercamel optimize -p ms-no-opt.fa -k 31 -c -o ms.fa
  2. Create an FMS index from the masked superstring:

    fmsi index -p ms.fa -k 31
  3. Query the index from a given query FASTA file with k-mers:

    fmsi query -p ms.fa -q query.fa -k 31 -O

Specific-case usage

If you do not need support for streaming queries, use the -s flag when querying for additional memory savings.

If your mask superstring does not maximizes the number of ones in the mask, omit the -O optimization flag for query as otherwise you might get incorrect results. We, however, recommend to optimize the mask using kmercamel optimize.

k-mer set operations (experimental)

Basic: k-mer set operations managed by FMSI

If you wish to perform set operations or answer membership queries without the need to understand the details of the $f$-masked superstring framework, FMSI can manage the details for you.

This can be done in a few simple steps:

  1. Compute a masked superstring.
    • This can be done by KmerCamel🐫; simply run kmercamel -c -k 31 -p fasta.fa -o ms.fa (with appropriate values for -k and -p).
    • If you obtained the masked superstring in a different way, make sure it minimizes the number of ones; if you're unsure, you can use kmercamel optimize -c -a zeros -k 31 -p ms_more_ones.fa -o ms.fa. No need to optimize superstrings directly computed by KmerCamel🐫.
  2. Index the masked superstring with fmsi index -p ms.fa.
  3. Perform the set operations you wish with fmsi [operation] -k 31 -p ms1.fa -p ms2.fa -r ms.fa. Possible operations (use the names instead of [operation]) are:
    • union to compute the union.
    • inter to compute the intersection
    • diff to compute the set difference
    • symdiff to compute the symmetric difference
  4. Query the index with fmsi query -p ms.fa -q query.fa -k 31.
  5. To get back the underlying masked superstring, use fmsi export -p ms.fa.

If you use FMSI this way, it ensures that operations in any order and queries to any index are computed correctly, while keeping the memory usage for queries as low as possible. Furthermore, exported $f$-masked superstrings are always, or-masked superstrings, which are the default masked superstrings.

The only downside to this approach is that each set operation uses compaction, which is the most time- and memory- consuming part of the process, which in some use cases might cause slowdowns which are not necessary. If this is your case, you probably want to stick to the advanced usage, managing the functions and building block methods yourself.

Advanced: externally managed k-mer set operations

If you wish to manage the operations yourself, the workflow is quite similar to the basic usage, with the following changes:

Commands overview

To run the tool, run fmsi [command]

The recognized commands are:

Each command has its own set of arguments, which can be displayed by running fmsi [command] -h.

How it works?

FMSI builds on the representation of k-mer sets via masked superstrings, which ensures a generally compact representation of k-mers, requiring typically 1-1.4 characters per k-mer for genomes and pan-genomes.

FMSI then builds a simplified FM-index, which omits the sampled suffix array, which is the key to the memory efficiency. It instead computes the SA-transformed mask, which makes it possible to determine the presence of a k-mer without the need to compute the original coordinates.

Additionally, FMSI optionally constructs the kLCP array (a binary version of the truncated LCP array) to allow O(1) time for a k-mer in streaming queries.

The memory consumption of the index is split as follows:

The construction of the index requires the full suffix array to be computed and requires up to 17 bytes per superstring character.

Testing

To run the associated tests, simply run make test.

Issues

Please use GitHub issues.

Changelog

See Releases.

Licence

MIT

Contact

Ondrej Sladky \ondra.sladky@gmail.com\\