Kmer-File-Format / kff-reference

Contains the description of a file format to store kmers and associated values
GNU Affero General Public License v3.0
31 stars 4 forks source link

KFF 1 - Kmer File Format v1

This repository defines a universal file format to store DNA kmers and their associated data. The associated data for a kmer can be anything as long as the size is constant across kmers. For example, an integer count between 0 and 255 can be stored in a byte associated to each kmer.

For I/O APIs and tools, please have a look at the KFF organization repositories

Overview

A KFF file is composed of two parts (and 2 signatures):

For example, it contains the file format version and the binary encoding of nucleotides.

As described in the following parts, there are multiple types of sections and most of them are representing sequences in different ways. Each representation has its own set of advantages. For instance, overlapping kmers can be represented as sequences to save space. For example with k=3, a sequence ACTTG will represent the set of kmers {ACT, CTT, TTG}.

Byte alignment

For faster processing, all values are stored on multiples of 8 bits to match a byte. So, if a DNA sequence needs 12 bits to be represented, 16 bits will be used and the 4 most significant bits can be set arbitrarily. All the values (sequences and integers) are stored as big endian.

Convention

In the diagrams below, we will use the following graphical elements:

     +---+
     |   |   one byte
     +---+

     +--+--+
     |     |  two bytes
     +--+--+

     +--+--+--+--+
     |           |  four bytes (etc..)
     +--+--+--+--+

     +==============+
     |              |  a variable number of bytes
     +==============+

     +=====================+
     | ascii(Hello world!) |  the ascii representation of a string (here, "Hello world!")
     +=====================+

     +===============+
     | c-string(KFF) |  the ascii representation of a string (here, "Hello world!") plus a null character
     +===============+

     +=============+
     | 2bit(ACCTG) |   a DNA sequence (here, ACCTG) in 2 bit encoding, big endian byte order.
     +=============+

Header

The file header defines values that are valid for the whole file. Some variables, such as k, are not defined in the header but in a section (called 'values declarations'). That way, if multiple k values are used, the file can redefine k on the fly between sections.

Header structure (required elements):

+--------+---------------+---------------+----------+---+---+--+--+--+--+============+
| Marker | Major version | Minor version | Encoding | U | C | Free size | Free block |
+--------+---------------+---------------+----------+---+---+--+--+--+--+============+

Example:

+------------+----+----+----+---+---+--+--+--+--+=====================+
| ascii(KFF) | 02 | 04 | 2d | 1 | 1 | 0000000d  | ascii(Hello world!) |
+------------+----+----+----+---+---+--+--+--+--+=====================+

Sections

The main part of the file is a succession of sections. These sections can be of different types. The most important two are the 'values declarations' and the 'raw data' sections. Other sections are used in particular contexts to store sequences more efficiently than raw data sections or to augment the file with more information (like an index).

The first byte of each section defines its type.

Section: values declarations ('v')

A 'v' section can be seen as a scope where a set of values are defined. The scope ends as soon as we encounter another 'v' section. This means that all values defined in a scope are valid for that scope only.

The rationale for 'v' section is to define some variables that will apply to the following sequence sections (the value of k for example). In fact, some variables will be required to be defined. Those requirements will be explained later in this specification (in the 'values requirement' parts of the sequence sections 'r' and 'm'). Each value declaration is a (name,value) pair where a name is a ASCII text ending with a '\0' character, and a value is a 64 bits field. After the end of the scope, each value is reset to undefined.

Section contents:

+-------+--+--+--+--+--+--+--+--+========+--+--+--+--+--+--+--+--+===+=========+--+--+--+--+--+--+--+--+
| type  |         nb_vars       | name 1 |         value 1       | … | name X |         value X        |
+-------+--+--+--+--+--+--+--+--+========+--+--+--+--+--+--+--+--+===+=========+--+--+--+--+--+--+--+--+

Example:

+----------+--+--+--+--+--+--+--+--+=============+--+--+--+--+--+--+--+--+===============+--+--+--+--+--+--+--+--+=====================+--+--+--+--+--+--+--+--+
| ascii(v) |            3          | c-string(k) |          A            | c-string(max) |           ff          | c-string(data_size) |             1         |
+----------+--+--+--+--+--+--+--+--+=============+--+--+--+--+--+--+--+--+===============+--+--+--+--+--+--+--+--+=====================+--+--+--+--+--+--+--+--+

Section: raw sequences ('r')

This section is composed of a list of sequence/data pairs. A sequence S of size s represents a list of n kmers where n=s-k+1. The data linked to S is of size data_size * n. We call each of these pairs a block. The sequences are represented in a compacted way with 2 bits per nucleotide.

Values requirement:

Section contents:

Plain text example (k=10, max=255, data_size=1 (counts), A=0, C=2, G=3, T=1):

ACTAAACTGATT [32, 47, 1] : sequence translation 0b001001000000100111000101 or 0x2409c5
AAACTGATCG [12]          : sequence translation 0b00000010011100011011 or 0x0271b
CTAAACTGATT [1, 47]      : sequence translation 0b1001000000100111000101 or 0x2409c5

Same example translated as a raw sequence section:

+----------+--+--+--+--+--+--+--+--+---+===+====================+========+---+====================+====+---+===================+===========+
| ascii(r) |            3              | 3 | 2bit(ACTAAACTGATT) | 202f01 | 1 | 2bit(AAACTGATCG)   | 0c | 2 | 2bit(CTAAACTGATT) |    012f   |
+----------+--+--+--+--+--+--+--+--+---+===+====================+========+---+====================+====+---+===================+===========+

Section: sequences with minimizers ('m')

In many applications, kmers are bins using minimizer technics. It means that a succession of kmers or superkmers share a common substring. In this file format, in the case where you know sets of sequences that share this kind of substring, you can use minimizer sections. The common minimizer is stored at the beginning of the section and deleted in the sequences. A index is joined to the sequences to recall the minimize position and be able to reconstruct all the kmers.

Values needed:

Section contents:

Plain text example (k=10, m=8, max=255, data_size=1 (counts), A=0, C=2, G=3, T=1):

ACTAAACTGATT [32, 47, 1] : sequence translation 0b001001000000100111000101 or 0x2409c5
AAACTGATCG [12]          : sequence translation 0b00000010011100011011 or 0x0271b
CTAAACTGATT [1, 47]      : sequence translation 0b1001000000100111000101 or 0x2409c5

Same example translated as a minimizer sequence section:

+----------+================+--+--+--+--+--+--+--+--+--+===+===+============+========+===+===+==========+====+===+===+===========+======+
| ascii(m) | 2bit(AAACTGAT) |             3            | 3 | 3 | 2bit(ACTT) | 202f01 | 1 | 0 | 2bit(CG) | 0c | 2 | 2 | 2bit(CTT) | 012f |
+----------+================+--+--+--+--+--+--+--+--+--+===+===+============+========+===+===+==========+====+===+===+===========+======+

This small example have been reduced in size from 23 bytes to 22 bytes using minimizer blocks instead of raw blocks.

Section: index

Many applications need fast access to slices of files. The index section is a collection of relative pointers to the sections of a file. By reading first the index, one can directly jump to any section. It is useful for e.g. reading the KFF file in parallel, or accessing a section corresponding to a certain minimizer. The index section has been designed to store the start address of some or all the sections present in the file. Combined with a normalized footer (see Good practices section), random access to the sections is possible. Index sections can be chained: the last value of the I section is the position of the next index section (0 if last).

A side note regarding fast extraction of k-mer data: the Index section does not provide fast random access at the level of k-mers, but only at the level of entire sections. Yet, when k-mers are sorted within a section, then fast random access can be achieved by binary search (as done in KMC). Supporting o(log(n)) random accesses k-mers is on our roadmap, but is not currently a feature of KFF.

Values needed: None.

Section contents:

Example:

+----------+--+--+--+--+--+--+--+--+----------+--+--+--+--+--+--+--+--+----------+--+--+--+--+--+--+--+--+----------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ascii(i) |           3           | ascii(v) |           0           | ascii(m) |          243          | ascii(r) |         -1463         |          3265         |
+----------+--+--+--+--+--+--+--+--+----------+--+--+--+--+--+--+--+--+----------+--+--+--+--+--+--+--+--+----------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Good practices

Here we introduce good practices that are recommended but not mandatory.

Footer

When we are looking for kmer sets, we often need some global statistics and we would like to have them without parsing a whole KFF file. For example, we can need the number of distinct kmer inside of the file. We recommend to add such statistics in a footer 'v' section. To know where this footer starts, the last value must be "footer_size" and its corresponding value.

To read the statistics, go to 23 Bytes before the end of the file [3 (signature) + 8 (value size) + 12 (footer_size string size)]. Then read the 12 bytes name of the value and if it corresponds to "footer_size", read the 8 bytes following value. Finally go back of this value + 3 (signature) from the end of the file. You can now read the footer.

First index block

If your KFF file is fully index a good practice is to have the first section of the file that is a index section. If it is not possible or not practical in your case, you also can indicate its absolute position in the file declaring a value "first_index" in a footer.