Add support for optional runtime selection of byte-frequency table.

I propose a small change to src/memmem/rarebytes.rs that will allow users to optionally choose a different byte-frequency table at runtime.

I am writing a library that scans for byte patterns in the memory of an executing process. The algorithm used by memchr would be ideal for scanning byte patterns of non-aligned length (i.e. 13 bytes). However, the performance of the algorithm is highly dependent upon the prefilter, which itself is highly dependent on the quality of the byte frequency table used to implement the prefilter. While the table in memchr::memmem::byte_frequencies is very good for most cases, in the specific case of searching through the memory of a binary executable it is less than optimal.

You can see this immediately by just considering the \x00 byte. If you count the most frequent bytes found in binary executables, the `\x00' byte has a frequency that is almost an order of magnitude higher than the next most frequent byte ('H'). This makes sense, since integers use zero bytes for padding (e.g. 5u32 in bytes is [0, 0, 0, 5]).

However, in the default byte frequency table the \x00 byte is not even particularly frequent. This makes perfect sense for most inputs, but results in dramatically less throughput when scanning binary executables.

The main change is to store a static pointer to the 'current frequency table' and use the stored pointer when calculating the rank for a byte in memchr::memmem::rarebytes::rank. Here are some justifications (see comments for more details):

A static global is used to be as un-intrusive as possible on the rest of the API.
A pointer is used so that modifications can be done atomically with AtomicUsize.
AtomicUsize is used due to performance and thread safety.
AtomicPtr is avoided since the pointer that is stored is never mutable.

I have only added one benchmark to prove that it can be worth it to use a custom byte frequency table. Since the assembly for the actual search algorithm has not been changed, I think we just need a benchmark to prove the change does not cause a regression when constructing a search (already in the benchmark suite), and a benchmark to prove that the change can be beneficial (included in the pull request).

Finally, here are the commands I used for benchmarking: To check for no regression: Note: I tried to find the benchmark that spends the most time constructing a finder rather than searching since the search algorithm has not been touched, only the code that constructs a RareNeedleBytes object.

cargo bench -- 'memmem/krate/oneshot/teeny-en/never-john-watson' --save-baseline master
cargo bench -- 'memmem/krate/oneshot/teeny-en/never-john-watson' --save-baseline freq
critcmp freq master

To check for improved throughput:

cargo bench -- 'memmem/krate/binary/*' --save-baseline bin
critcmp bin -g 'memmem/krate/binary/(\w+)'

I am new to open-source collaboration, so any feedback will be much appreciated.

Thank you for the PR and the thorough write-up. But I think it would be better to open a new issue and discuss the use case and brainstorm solutions to it. Namely, introducing shared global mutable state as a public API doesn't sound like a good idea to me. There is a FinderBuilder after all, and that feels like the best place to introduce a new API like this (if at all).

I'd also like to consider options like "make \x00 be considered frequent by default."

Another thing I'm curious about is why \x00 alone being misclassified is a big problem. The prefilter doesn't just look for one rare byte but two, so as long as the second rare byte is somewhat accurate, the prefilter should be okay?

There are unfortunately bigger considerations here too. This particular byte frequency table is actually copied into a few different crates. But there is also no shared common dependency (and I'm not inclined to add one). Having two different copies of this table in active use is probably not a good idea. So this poses a quite thorny problem.

Thanks for responding so quickly! I will create an issue, however I do want to explain a few of the things you mentioned here just for clarity.

Firstly, I wanted to ask what you meant about the byte frequency table being in multiple crates? I am not sure I understand, the original table is untouched, I just added the ability to choose another (I probably am misunderstanding)? Also, everything I mention below is in the context of repeated scanning millions of small, similar needles in a binary executable and its allocated memory.

The \x00 byte. I used this in my explanations as I believe it illustrates the point I was trying to make clearly, but there are a few more things to mention. While the \x00 byte is the most frequent in all binary programs, the rest of the byte frequency distribution depends on the architecture of the machine code being scanned (or language of the bytecode!). This is something known in advance by the user, and likely different for every user. This knowledge can be exploited on a per-case basis for faster searches, which is why I think the entire byte frequency table should be specified by the user.

Even though the prefilter scans 2 bytes, you might only be scanning for 2 common bytes (e.g. a u16). For each byte in the frequency table that has an incorrect frequency the speed of a search is decreased proportional to the error and how frequently the prefilter will match that byte. Consider a user scanning for an arbitrary u32 value. There are some values of the u32 that will contain very rare bytes, and allow the search to finish much faster than scanning for other less rare bytes. The user might even be able to control whether these rare bytes appear in the program's memory or not. Which exact values of the u32 contain rare bytes can change on a per-program basis, and providing a general byte frequency table that can represent this seems impossible to me. If the user might reasonably scan for any valid integer value in a binary program, then I think that even small errors in the frequency distribution can quickly add up across all inputs.

Secondly, the global mutable state. I completely agree with it going in FinderBuilder, and that is actually where I first put it, but I was not able to completely remove the impact on the performance of constructing a Finder with the default frequency table. Also, by changing FinderBuilder (and consequently NeedleInfo, RareNeedleBytes and others), multiple breaking changes needed to be introduced to the public API. Considering how widely used memchr is, this made me quite uncomfortable (its even used by cargo haha). Since customizing the byte frequency table seems like a very niche use-case to me, this further discouraged me from introducing noise in the public API at all.

Finally, I think that the static pointer is icky, and the conversion of pointer to integer is even worse, but there are a few benefits I feel are worth mentioning:

0 breaking changes to the public API (only one public function was added, so all memory layouts are still the same).
Minimal instructions added to the default code path (just a load and a well predicted branch).
AtomicUsize is supported basically everywhere
The only code affected by the global mutable state is the construction of a Finder (details below).

The only case in which this global state might have an ambiguous value is if someone is creating Finders and calling set_byte_frequencies in multiple threads simultaneously (which seems like a bad idea). Since the original frequency table is already a global constant, I imagined that the user would only ever want to call this function once at the start of the program, kind of like saying 'I want to use this specific version of the memchr algorithm', and then proceed to use memchr how they normally would.

I really think you should just create an issue... All of this is stuff we should be talking about in an issue, not in a PR, because this PR is very likely not the solution that will get merged.

Firstly, I wanted to ask what you meant about the byte frequency table being in multiple crates?

It's in aho-corasick and will soon be in regex-syntax. It may actually be in more places too, but that's what I can think of off the top of my head.

All of these things play in concert with one another when it comes time to build prefilters inside the regex crate. If one crate has a different understanding of which bytes are "rare" or not, then its decisions may not mesh well with other crates. This is especially relevant given the implementation path chosen here: it's shared mutable state. So someone could swap out the table used inside of memchr, but without equivalent swaps in other crates, that might make regex behave sub-optimally in unpredictable ways in other places in the program.

Now if we didn't try to swap tables via shared mutable state and instead did it as part of constructing an individual searcher, then we wouldn't have to worry about different crates disagreeing about things too much. Because we could always be sure that the defaults are being used (which are in all sync by virtue of being maintained by the same person).

The \x00 byte. I used this in my explanations as I believe it illustrates the point I was trying to make clearly, but there are a few more things to mention. While the \x00 byte is the most frequent in all binary programs, the rest of the byte frequency distribution depends on the architecture of the machine code being scanned (or language of the bytecode!). This is something known in advance by the user, and likely different for every user. This knowledge can be exploited on a per-case basis for faster searches, which is why I think the entire byte frequency table should be specified by the user.

Oh yes, certainly, I understand the general idea here. The frequency table is indeed a heuristic and it was "trained" on certain corpora. It might do very poorly in other areas. But then again, so will substring search algorithms that pick bytes at fixed offsets too. That is, the use of a byte frequency table is in general no better or worse than other substring algorithms, but performance may of course differ in specific examples.

I am not opposed to making the frequency distribution configurable in some way. It has been on my mind since I first introduced the concept in the regex crate many years ago. But you are actually the first to specifically ask for it. :-) I just think it is quite a thorny issue to make this work. And I wanted to poke at the specific case presented. Just because I agree with the general idea of swapping out the frequency distribution doesn't mean your specific problem has to be solved that way.

Even though the prefilter scans 2 bytes, you might only be scanning for 2 common bytes (e.g. a u16). For each byte in the frequency table that has an incorrect frequency the speed of a search is decreased proportional to the error and how frequently the prefilter will match that byte. Consider a user scanning for an arbitrary u32 value. There are some values of the u32 that will contain very rare bytes, and allow the search to finish much faster than scanning for other less rare bytes. The user might even be able to control whether these rare bytes appear in the program's memory or not. Which exact values of the u32 contain rare bytes can change on a per-program basis, and providing a general byte frequency table that can represent this seems impossible to me. If the user might reasonably scan for any valid integer value in a binary program, then I think that even small errors in the frequency distribution can quickly add up across all inputs.

Yes, I understand that if we always get one of the rare bytes wrong, that increases the chances of the prefilter being ineffective generally. I guess I was more or less looking for a specific example that you're actually facing. (And not one that you've constructed hypothetically. I could of course construct such an example as well. And indeed, I believe there are benchmarks for it too.)

Secondly, the global mutable state. I completely agree with it going in FinderBuilder, and that is actually where I first put it, but I was not able to completely remove the impact on the performance of constructing a Finder with the default frequency table. Also, by changing FinderBuilder (and consequently NeedleInfo, RareNeedleBytes and others), multiple breaking changes needed to be introduced to the public API. Considering how widely used memchr is, this made me quite uncomfortable (its even used by cargo haha). Since customizing the byte frequency table seems like a very niche use-case to me, this further discouraged me from introducing noise in the public API at all.

OK this is really stuff that should be in an issue. Because I would want to know 1) why perf is decreased and by how much and 2) what breaking changes result from this. Like if perf decreased by 1% that seems fine. If it gets twice as slow, then that seems not fine, unless it's only twice as slow when you override the frequency table and stays the same otherwise. In which case, that seems fine.

As far as breaking changes, neither NeedleInfo nor RareNeedleBytes are part of the public API. Without digging into it, I would naively expect the only API change to be some kind of new method on FinderBuilder and that's it as far as public API changes are concerned.

So this is where this should become an issue and we should be talking about the specific API changes and a straw man implementation path.

I can say with near certainty that I'm not going to accept a change with shared mutable state.

Since the original frequency table is already a global constant, I imagined that the user would only ever want to call this function once at the start of the program, kind of like saying 'I want to use this specific version of the memchr algorithm', and then proceed to use memchr how they normally would.

Yes, but if we're going to expose changing the frequency distribution, then a single program might want to search multiple distinct corpora that each want their own frequency distributions. Hell, it might even make sense in some cases to compute the frequency distribution from a random sample during runtime, and then use it for subsequent searches.

I'm currently making an issue that more cleanly explains the stuff I write above, sorry about that. I just wanted to quickly respond to a few things while I take more time to make the issue more presentable.

I now understand completely what you meant by the frequency table being in other crates, and did not consider the impact of that at all! The shared mutable state is certainly much worse if it can interact with things outside of the crate it is defined in. I incorrectly assumed that this table was a private implementation detail.

And I totally agree with everything you stated above, and will address this in the issue I create. I really appreciate all the feedback by the way! :)

P.S. The perf difference I was getting with FinderBuilder was around 10% which I'm pretty sure was a performance bug but I couldn't figure out why, and due to how popular memchr is any perf difference seemed like a sin haha.

BurntSushi / memchr

Add support for optional runtime selection of byte-frequency table. #116