Kmer-File-Format / kff-reference

Contains the description of a file format to store kmers and associated values
GNU Affero General Public License v3.0
31 stars 4 forks source link

Byte-oriented format causes lots of misaligned access #12

Open asl opened 3 years ago

asl commented 3 years ago

I'm having a major concern about the design decision to use only byte alignment. IMO, this is quite bad in many ways and would certainly affect and slowdown many applications:

Why should we bother with data alignment when we're just talking about storage format? The quick answer is: memory mapped files. Proper padding and alignment would allow great use of memory mapped files that might be used for many things, but mostly – for reducing I/O overheads and even for async I/O.

I'm proposing to think about two additions / changes:

rchikhi commented 3 years ago

very interesting, thanks much for the input. I'm curious if the slowdown would occur when doing random access to sections, or just all-around slowdown even for sequentially reading the whole file?

asl commented 3 years ago

Well, usually memory-mapped filed allows one to skip additional copies and even process the data in-place. Usually they are much better for random access when one needs to process the whole file as data is loaded in pagesize chunks (certainly, it does not make much sense for mmap'ed I/O when you need only couple of values)

rchikhi commented 3 years ago

I'm trying to pre-empt some of the pushback I expect from Yoann, such as: KFF isn't designed to be an index and in the contrary to be as small as possible, so if padding makes the file 1.5x bigger it might be a net loss. Curious about your thoughts on that aspect

asl commented 3 years ago

I think then the intention of KFF format needs to be clearly specified. If it is meant for general-purpose k-mer storage format, then it should assume different usages, e.g. streaming, random access, memory mappings, etc.

One way to ensure this is to allow some bits of flexibility in the format and allow e.g. sections to start from arbitrary offsets, and / or encode the k-mer width in the metadata.

If the intention is to make the file as small as possible, then this needs to be clearly specified. And more bit-packing tricks should be used :)

rchikhi commented 3 years ago

hehe this is a great comment which points that we overlooked saying our priorities explicitly because we were focusing on the technicals.

Initially I (and likely Yoann too) saw KFF through the prism of succinctness, to demonstrate there were interesting space gains left on the table for existing ways to store kmers. Not quite full-on compression though, nor through engineering tricks (e.g. bit tricks) but through a non-trivial yet economical reorganization of the data. That said, you're right to point out that anyone seeing KFF today will think of it to be general-purpose.

I see a few possible directions, and curious to hear opinions of all involved (@yoann-dufresne @asl @natir @marekkokot @sebastiandeorowicz & others).

There are pros and cons for either: 1) Focus KFF only on space optimization and leave it as that (closing this PR) 2) Have KFFv1 stay as-is and add this flexibility in KFFv2 (losing compatibility with KFFv1) 3) Right now allow some flexibility as @asl suggests, i.e. add kmer width and section padding to the format.

I'll leave the final word to Yoann though.

marekkokot commented 3 years ago

I agree with @asl that the intention of the format should be clearly specified.

I talked with @sebastiandeorowicz and here are our thoughts. We think that the main use case of the format is to read it in a streaming way (eventually reading many sections "at once", but each of them sequentially). If one needs to have for example fast queries, they should load KFF content to a data structure appropriate for the application. In fact, I am not sure how exactly the find query could be done efficiently directly on KFF (there is no info in which section specific k-mer may be present, k-mer may be in a block of larger size). The other aspect is the format simplicity. On the other hand, I am not very experienced with mmapped files, so maybe I am missing something.

In summary, we think that option 2 is reasonable for this moment, although the potential loss of compatibility is a concern.

natir commented 3 years ago

When I started working with @yoann-dufresne on the format I understood that the main objective was to have a storage format that had no assumptions about how it would be used. And if we have to make a trade-off between speed and space, we would prefer space.

If one needs to have for example fast queries, they should load KFF content to a data structure appropriate for the application.

For me KFF is a storage format and not at all a query or index format, so I agree with this sentence.

I'm sure that having 64-bit aligned data is faster than having 8-bit aligned data for currently available CPUs. But I'm not convinced that this time saving is important enough to justify the loss of space. But I'd like to be convinced of the advantages of 64-bit alignment.

It seems to me that option 2 is the most consensual. From my point of view, breaking compatibility is not a problem, as long as users are informed of this (with a major version number change) and a conversion tool is provided.

yoann-dufresne commented 3 years ago

As @rchikhi said, at the beginning we where motivated by the creation of a non compressed format that takes advantage of kmer set reorganization to reduce space. I wrote slow API code to demonstrate the size reduction possibilities. Now, after many discussions with a lot of different people, on top of the previous requirements, I would like the format to be compatible with fast streaming access, in order to be a compact and practical exchange format between tools. But as @natir said, the file format is designed for storage not indexing. So, it will only support log time "find" queries if the sections and kmer/superkmers/sequences are sorted.

So, for now, I think that I don't want to add padding for sequences/data alignment due to file size inflation (But I still can be convinced). Maybe in the next iteration of the format by carefully looking at the space/loading time trade-off? In this version, maybe we can align sections if it is useful. I'm also not mmap expert, so here is a naive question: Can't we just mmap a section that is not page-aligned? I mean, by mmap from the beginning of the page where the section start lays? (We can know the page start address if there is an index in the kff and if we know the page size).