FMSI is a highly memory-efficient tool for performing membership queries on single $k$-mer sets. FMSI uses masked superstrings for storing the $k$-mer sets to ensure high compressibility for a wide range of different $k$-mer sets, and implements FMS-index, a simplification of the FM-index. It supports both streaming and single queries. The functionality implemented in FMSI is based on the following papers:
The memory consumption for FMSI are (w/o kLCP which is additional 1bit/superstring char):
[1] Ondřej Sladký, Pavel Veselý, and Karel Břinda: FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT). to appear on bioRxiv, 2024.
[2] Ondřej Sladký, Pavel Veselý, and Karel Břinda: Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k-Mer Sets. bioRxiv 2024.03.06.583483, 2024. https://doi.org/10.1101/2024.03.06.583483
[3] Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717
To construct an index (the fmsi index
subcommand), FMSI accepts as input (see the -p
parameter) a masked superstring of the $k$-mer set.
The masked superstring can be computed by KmerCamel🐫.
It then stores the index file in files with the same prefix and the .fmsi.[component]
extension.
There are these components:
.ac_gt
, .ac
and .gt
for storing the nodes of the wavelet tree of BWT.mask
for storing the RRR-compressed SA-transformed mask.klcp
for storing the kLCP array (optional).misc
for storing counts, dollar position and kTo query the index (the fmsi query
subcommand), FMSI accepts a text file with $k$-mers on separate lines to query (see the -q
parameter).
In a very near future, we will however update this to accept a FASTA file and allow for streaming queries.
Additionally, FMSI experimentally supports set operations using the $f$-masked superstrings framework. However, this feature is currently only experimental and requires rather significant time and memory.
Prerequisities:
Installation: FMSI can be installed using bioconda:
conda install bioconda::fmsi
Alternatively, FMSI can be built directly from source:
git clone --recursive https://github.com/OndrejSladky/fmsi
cd fmsi && make -j
export PATH="$PATH:$(pwd)"
Compute an optimized masked superstring for all k-mers from a given FASTA file by KmerCamel🐫.
kmercamel -p input_file.fa -k 31 -c -o ms-no-opt.fa
kmercamel optimize -p ms-no-opt.fa -k 31 -c -o ms.fa
Create an FMS index from the masked superstring:
fmsi index -p ms.fa -k 31
Query the index from a given query FASTA file with k-mers:
fmsi query -p ms.fa -q query.fa -k 31 -O
If you do not need support for streaming queries, use the -s
flag when querying for additional memory savings.
If your mask superstring does not maximizes the number of ones in the mask, omit the -O
optimization flag for query as otherwise you might get incorrect results.
We, however, recommend to optimize the mask using kmercamel optimize
.
If you wish to perform set operations or answer membership queries without the need to understand the details of the $f$-masked superstring framework, FMSI can manage the details for you.
This can be done in a few simple steps:
kmercamel -c -k 31 -p fasta.fa -o ms.fa
(with appropriate values for -k
and -p
).kmercamel optimize -c -a zeros -k 31 -p ms_more_ones.fa -o ms.fa
. No need to optimize superstrings directly computed by KmerCamel🐫.fmsi index -p ms.fa
.fmsi [operation] -k 31 -p ms1.fa -p ms2.fa -r ms.fa
. Possible operations (use the names instead of [operation]
) are:
union
to compute the union.inter
to compute the intersectiondiff
to compute the set differencesymdiff
to compute the symmetric differencefmsi query -p ms.fa -q query.fa -k 31
.fmsi export -p ms.fa
.If you use FMSI this way, it ensures that operations in any order and queries to any index are computed correctly, while keeping the memory usage for queries as low as possible. Furthermore, exported $f$-masked superstrings are always, or-masked superstrings, which are the default masked superstrings.
The only downside to this approach is that each set operation uses compaction, which is the most time- and memory- consuming part of the process, which in some use cases might cause slowdowns which are not necessary. If this is your case, you probably want to stick to the advanced usage, managing the functions and building block methods yourself.
If you wish to manage the operations yourself, the workflow is quite similar to the basic usage, with the following changes:
fmsi merge
. Details on which $f$ should be used is described in Chapter 4 of the paper.fmsi compact
.-f
argument. The same applies to compaction.To run the tool, run fmsi [command]
The recognized commands are:
index
Creates a BWT based index of the given masked superstring.query
Queries a $k$-mer against an index.export
Export the underlying masked superstring.union
Performs union on two indexes. Expects or-MS.inter
Performs intersection on two indexes. Expects or-MS.diff
Performs set difference on two indexes. Expects or-MS.symdiff
Performs symmetric difference on two indexes. Expects or-MS.merge
Merge several indexes.compact
Compacts an index.Each command has its own set of arguments, which can be displayed by running fmsi [command] -h
.
FMSI builds on the representation of k-mer sets via masked superstrings, which ensures a generally compact representation of k-mers, requiring typically 1-1.4 characters per k-mer for genomes and pan-genomes.
FMSI then builds a simplified FM-index, which omits the sampled suffix array, which is the key to the memory efficiency. It instead computes the SA-transformed mask, which makes it possible to determine the presence of a k-mer without the need to compute the original coordinates.
Additionally, FMSI optionally constructs the kLCP array (a binary version of the truncated LCP array) to allow O(1) time for a k-mer in streaming queries.
The memory consumption of the index is split as follows:
$2.125$ bits per superstring character to store the BWT ($2.125 - 3$ bits per k-mer for typical genomes and pan-genomes).
Typically $0 - 0.8$ bits per k-mer to store the SA-transformed mask.
Optionally $1$ bit per superstring character to store the kLCP array.
$1$ byte per character of the largest entry in the queried fasta file (which is typically negligible).
The construction of the index requires the full suffix array to be computed and requires up to 17 bytes per superstring character.
To run the associated tests, simply run make test
.
Please use GitHub issues.
See Releases.