Kmer-File-Format / kff-cpp-api

A C++ API to read and write kff files
GNU Affero General Public License v3.0
9 stars 8 forks source link

Faster Kff_reader and CMake version #14

Closed jltsiren closed 1 year ago

jltsiren commented 1 year ago

I started using this API in vg (vgteam/vg#3844), and I noticed a couple of issues.

First, reading a KFF file with Kff_reader can be quite slow. For example, KMC outputs each kmer as a separate block by default. Even if you read the kmers block-by-block, you apparently end up doing two read() calls for each kmer. Buffering should solve the issue.

Second, CMakeLists.txt says it requires CMake version 3.19, which is fairly new. For example, Ubuntu 20.04 LTS ships with CMake 3.16, which means that this API does not compile with the default tools found on many servers. The file seems fairly basic, so I believe it should work with older versions as well.

I could try doing a PR myself, but I'm not confident I understand how the minimizer sections work, so I could accidentally end up breaking something.

yoann-dufresne commented 1 year ago

Hi @jltsiren,

Yes, KFF API can be slow. I will have a look at the performances that point here. My goal is to improve the API speed while me or someone else is using it in projects that require performances. So, I'll be able to perform speed tests on real use cases.

For the CMake, I will have a look at the different versions in between 3.16 and 3.19

yoann-dufresne commented 1 year ago

Hope that answer your questions. If you need other features, do not hesitate to contact me.

jltsiren commented 1 year ago

Thank you! Now that CMake version 3.16 works, I got the CI tests to run.

Branch read_contiguous helped a bit, but the difference was pretty small. I guess the real issue is how the high-level nudges towards writing strictly sequential code that focuses on the general case. I have some ideas how to optimize the kmer handling code on our side. I'll return to the topic once I have a better understanding of the bottlenecks.

yoann-dufresne commented 1 year ago

The optimization is now available on main and dev