CERT-Polska / ursadb

Trigram database written in C++, suited for malware indexing
BSD 3-Clause "New" or "Revised" License
123 stars 25 forks source link

[META] Ursadb performance improvements #190

Open msm-code opened 1 year ago

msm-code commented 1 year ago

The problem

I'm trying to create a public instance of mquery again. After setting up a mid-size instance (a few TBs, on HDD) I've noticed that some ursadb queries run much slower than I would expect. I suspect that I've introduced a few performance regression over the last 1.5 years (I didn't have sufficiently large dataset to test. I would like to find and fix all performance regressions, and hopefully make the performance better than ever before.

The solution

This is a metaissue to track ideas and work being done. I will create separate issues later.

Early tests suggest that the biggest problem (on HDD) is slow disk read and seek times. We should limit the number of read operations in this case. This isn't as big problem on SSD, but disk IO is still at least 50% of query time - improving it will be nice.

Issues related to scientific testing and a benchmark suite

Issues related to things that can be fixed (all changes here should be benchmarked before merging to master). I also didn't think this all through yet - some of the ideas probably don't make sense.

Caveats

That's all I have for now. Most of the issues here circle around how to optimise the number of reads and disk IO. There may be other areas of improvment (for example, how to cut down the number of "ors" or "ands"), but I didn't think about that yet. I also didn't think about optimising the "constant", aka making the individual and/or//minof operations faster. I actually think they're implemented in a quite performant way, and there's not much we can do to make them much faster.

msm-code commented 1 year ago

Ursadb benchmarking utility created as a separate repo: https://github.com/msm-code/ursa-bench/ . Initial evaluations are being worked on (they'll be linked in the appropriate PRs)