Faster Kff_reader and CMake version

Kmer-File-Format / kff-cpp-api

A C++ API to read and write kff files

GNU Affero General Public License v3.0

9 stars 8 forks source link

Faster Kff_reader and CMake version #14

Closed jltsiren closed 1 year ago

jltsiren commented 1 year ago

I started using this API in vg (vgteam/vg#3844), and I noticed a couple of issues.

First, reading a KFF file with Kff_reader can be quite slow. For example, KMC outputs each kmer as a separate block by default. Even if you read the kmers block-by-block, you apparently end up doing two read() calls for each kmer. Buffering should solve the issue.

Second, CMakeLists.txt says it requires CMake version 3.19, which is fairly new. For example, Ubuntu 20.04 LTS ships with CMake 3.16, which means that this API does not compile with the default tools found on many servers. The file seems fairly basic, so I believe it should work with older versions as well.

I could try doing a PR myself, but I'm not confident I understand how the minimizer sections work, so I could accidentally end up breaking something.

yoann-dufresne commented 1 year ago

Hi @jltsiren,

Yes, KFF API can be slow. I will have a look at the performances that point here. My goal is to improve the API speed while me or someone else is using it in projects that require performances. So, I'll be able to perform speed tests on real use cases.

For the CMake, I will have a look at the different versions in between 3.16 and 3.19

yoann-dufresne commented 1 year ago

I rolled back the CMake version to 3.16. I'll support it until the end of Ubuntu 20.04 support (April 2025). Then I will change it for 3.22 if I do not need advanced features.
I started the investigation on the file reading speed. As you suggested, I replaced 2 read calls by only one and I now have a speedup of 1.5 . If you can check the speedup on your side using the branch "read_contiguous" it would be nice. Be aware that the branch is not yet fully safe for the moment. I still have to perform real data tests to check that the output did not change.
Be careful using the get_var function. Due to unordered_map, it is very slow if repeated frequently. You can access k, data_size and max directly as properties of the reader object. In future versions, I will replace unordered_map with external libraries.
For further speedup I will have to rewrite my file handling code. I'll do it in a few weeks to have a clean buffered file with page reading instead of file chunks.

Hope that answer your questions. If you need other features, do not hesitate to contact me.

jltsiren commented 1 year ago

Thank you! Now that CMake version 3.16 works, I got the CI tests to run.

Branch read_contiguous helped a bit, but the difference was pretty small. I guess the real issue is how the high-level nudges towards writing strictly sequential code that focuses on the general case. I have some ideas how to optimize the kmer handling code on our side. I'll return to the topic once I have a better understanding of the bottlenecks.

yoann-dufresne commented 1 year ago

The optimization is now available on main and dev