I'm trying to create a public instance of mquery again. After setting up a mid-size instance (a few TBs, on HDD) I've noticed that some ursadb queries run much slower than I would expect. I suspect that I've introduced a few performance regression over the last 1.5 years (I didn't have sufficiently large dataset to test. I would like to find and fix all performance regressions, and hopefully make the performance better than ever before.
The solution
This is a metaissue to track ideas and work being done. I will create separate issues later.
Early tests suggest that the biggest problem (on HDD) is slow disk read and seek times. We should limit the number of read operations in this case. This isn't as big problem on SSD, but disk IO is still at least 50% of query time - improving it will be nice.
Issues related to scientific testing and a benchmark suite
[x] Create a benchmarking utility to test performance improvements/regressions in a scientific way (this is possible thanks to counters introduced in ursadb 1.4) (I already have a tool for this, but I need to clean up the code and publish it somewhere)
[x] Create a benchmarking suite to test various ursadb versions in a scientific way. (I plan to use yara rules from signature-base)
[ ] Check how much disk speed influences ursadb speed. Evaluate potential disk speed/precision tradeoffs. Especially evaluate differences between performance on HDD and SSD (ideally including the cloud environments, for example gp2)
[ ] Write down and publish the benchmarking results somewhere. We can start with a simple static page. Final findings may be saved in the documentation or some separate pdf file.
Issues related to things that can be fixed (all changes here should be benchmarked before merging to master). I also didn't think this all through yet - some of the ideas probably don't make sense.
[x] Idea: cache primitive (string) queries. It's easy to write a yara rule, where a string is evaluated twice or more. We should cache string query results and reuse them (OnDiskDataset::query)
[x] Idea: rethink querygraphs (QueryGraph.cpp). They are a speed-precision tradeoff, favouring precision. This may not be the best idea - benchmark them on some real data.
[x] Idea: alternative to querygraph is a simple linear query. This mostly is how ursadb works now (because mquery is not aware of some more complex ursadb features). If we go that way, ensure that we don't do read requests unnecessarily for overlapping ranges - for example select "abcde" on indexes gram3/text3 will be split into ("abc" & "bcd" & "cde") & ("abcd" & "bcde") and this is 100% pointless and wasting CPU and disk time. This is a very obvious optimisation, but doesn't play with querygraphs in their current form nicely. #191
[x] Idea: cache raw index n-gram queries. This sounds simple, but doing this naively will eat up ungodly amounts of RAM. For example, we should only cache n-gram queries that will actually be reused. Maybe we can also optimise common subexpressions? For example "abcdefghijklm" | "abcdefghijkln" is, in theory, equivalent to "abcdefghijkl" & ("klm" | "kln") Can we do this elegantly?
[ ] Idea: rethink hash4 index. How much does it really help on big datasets? Especially compare this to the index size (it's as big as gram3 index). Write down the results in the documentation.
[ ] Idea: rethink index on disk storage format. Smaller storage = cheaper disks, better cache usage, and better bang for buck with SSDs. I don't think I can invent a better storage format myself, but maybe someone smart already published some research on that topic. This is a very speculative and low priority improvement idea.
Caveats
That's all I have for now. Most of the issues here circle around how to optimise the number of reads and disk IO. There may be other areas of improvment (for example, how to cut down the number of "ors" or "ands"), but I didn't think about that yet. I also didn't think about optimising the "constant", aka making the individual and/or//minof operations faster. I actually think they're implemented in a quite performant way, and there's not much we can do to make them much faster.
Ursadb benchmarking utility created as a separate repo: https://github.com/msm-code/ursa-bench/ . Initial evaluations are being worked on (they'll be linked in the appropriate PRs)
The problem
I'm trying to create a public instance of mquery again. After setting up a mid-size instance (a few TBs, on HDD) I've noticed that some ursadb queries run much slower than I would expect. I suspect that I've introduced a few performance regression over the last 1.5 years (I didn't have sufficiently large dataset to test. I would like to find and fix all performance regressions, and hopefully make the performance better than ever before.
The solution
This is a metaissue to track ideas and work being done. I will create separate issues later.
Early tests suggest that the biggest problem (on HDD) is slow disk read and seek times. We should limit the number of read operations in this case. This isn't as big problem on SSD, but disk IO is still at least 50% of query time - improving it will be nice.
Issues related to scientific testing and a benchmark suite
Issues related to things that can be fixed (all changes here should be benchmarked before merging to master). I also didn't think this all through yet - some of the ideas probably don't make sense.
select "abcde"
on indexes gram3/text3 will be split into("abc" & "bcd" & "cde") & ("abcd" & "bcde")
and this is 100% pointless and wasting CPU and disk time. This is a very obvious optimisation, but doesn't play with querygraphs in their current form nicely. #191"abcdefghijklm" | "abcdefghijkln"
is, in theory, equivalent to"abcdefghijkl" & ("klm" | "kln")
Can we do this elegantly?Caveats
That's all I have for now. Most of the issues here circle around how to optimise the number of reads and disk IO. There may be other areas of improvment (for example, how to cut down the number of "ors" or "ands"), but I didn't think about that yet. I also didn't think about optimising the "constant", aka making the individual and/or//minof operations faster. I actually think they're implemented in a quite performant way, and there's not much we can do to make them much faster.