Closed sentrip closed 1 year ago
Thank you this is a lovely issue! I definitely agree this is a problem we should solve and I think it's solveable.
One thing that sticks out to me here is that the representation of "byte frequency table" as a [u8; 256]
is perhaps itself just an implementation detail that doesn't necessarily need to be exposed. Namely, even internally, the table isn't really used directly. Instead, what's used is a ranking function:
Indeed, I think that all we actually need from the caller is something that takes any byte value as input and returns a rank (in the range 0..=255
) as output. Ideally, the contract of the function would be as simple as this: a smaller rank represents bytes that are heuristically assumed to be less frequent. (The reality is that this may not quite be the full story. While the memchr
crate doesn't currently do this, it is plausible that there could be checks like "if the rank of the rarest byte is above a certain value, then don't use a prefilter whose perf is tied to the frequency distribution of bytes at all." This means that it may not just be the relative rank that matters but also the absolute rank. Which is... kind of unfortunate because it makes the contract quite a bit more complicated.)
What this means is that we have a fair bit of flexibility in what we could accept. There are possibly more options, but I'll just list the ones I can think of off the top of my head quickly:
fn(u8) -> u8
. This is kind of analogous to accepting a &'static [u8; 256]
in that it doesn't permit much flexibility. But because it's a function pointer, the caller can technically insert some shenanigans to do various things, including using a frequency table that has been computed at runtime. (Such a thing would look quite similar to your PR implementation I imagine.)Arc<dyn (FnMut(u8) -> u8) + Send + Sync + UnwindSafe + RefUnwindSafe + 'static>
. This requires alloc
unfortunately, but otherwise works like (1).HeuristicFrequencyRank
that can be implemented by any type (including zero sized types). But it has the same problem as (2). In order to avoid the type parameter from infecting the type signature (which I would want to do even if we were designing this API from scratch), we'd have to erase it. And the only way to erase it without introducing a lifetime parameter is with an allocation. So this approach would also require alloc
.It is quite tempting to go with (1) because it's a universal API. If we didn't care about the no-std no-alloc use case, then I think (2) would be my preferred route.
One possible riff on (3) is to add a default type parameter to Finder
. It would default to a zero sized type representing the default byte frequency table, but could be overridden to be anything else. This wouldn't require alloc
because we would be accepting the infection of the type parameter instead, but kind of hiding it via the default type parameter trick. I believe it is the case that adding a default type parameter to an existing type is a non-breaking change.
There are also more complex ideas, such as combining the options above. For example, we might expose (1) in all configurations but (2) in just configurations that have alloc
. But I'm not a huge fan of that because of the complexity, and I do wonder whether it might have a perf impact because of the additional space required to store both things.
I think I like (3) with the default type parameter trick the best? Because it should literally compile down to absolutely zero overhead for any table that is truly static, since it would just be implemented via a zero-sized type. As long as the default type parameter is a non-breaking change (and again I believe it is), it is a lamentable API addition but one that I think I can live with because folks can legitimately completely ignore it and not be worse off. The addition of an entire trait also kind of sucks too, and I honestly hate adding traits to crates unless they're really well motivated. But because of the no-std no-alloc constraint here, our choices are quite limited.
(Note: There is currently only a std
mode and a no-std no-alloc mode. There ought to be a no-std-but-with-alloc mode too, however it hasn't been added yet because most things in this crate don't benefit from it. The only exception to this currently is the Finder::into_owned
API, which is currently only available when std
is enabled, but should in theory only require alloc
. Otherwise, generally speaking, the only impact std
has is that it enables runtime CPU feature detection to unlock SIMD optimizations that utilize AVX2.)
Thanks for the quick and very detailed response!
I agree that HeuristicFrequencyRank
(3. with default type parameter) is the most flexible option, and the most likely to not regress performance given how in the default case the assembly should likely be very similar or identical to the original implementation, as the zero-sized type should be optimized away (like you mentioned).
the representation of "byte frequency table" as a [u8; 256] is perhaps itself just an implementation detail that doesn't necessarily need to be exposed
I agree 100%. For some reason I completely ignored 'generic' implementations (like traits/functions) probably due to how small the input space is (0..=255) and the many restrictions that come along with no-std no-alloc considerations.
or example, we might expose (1) in all configurations but (2) in just configurations that have alloc. But I'm not a huge fan of that because of the complexity
I am also against that kind of complexity, especially when it results in a lot more cases that need to be tested which I believe is the case here. Also the added complexity is pretty much an implementation detail due to no-std no-alloc rather than being necessary to represent the fundamental idea (mapping u8 -> u8).
I believe it is the case that adding a default type parameter to an existing type is a non-breaking change.
I would imagine ideally it wouldn't but in practice might affect the output and I honestly do not know enough about how the rust compiler works yet to make that kind of decision. However since it seems worth investigating I will look into it.
it is a lamentable API addition I honestly hate adding traits to crates unless they're really well motivated
I also agree with these, and the idea of changing the pubic API for people that are not interested in this feature makes me uncomfortable.
My first goal will be to implement HeuristicFrequencyRank
and the default type parameter with as little impact as possible on the assembly generated for code that uses the default frequency table. I'm going to benchmark this with a specific benchmark that is only for measuring the time to create a Finder
and use the selected frequency table once (or as few times as possible).
If I can implement this without affecting the default code path, I will consider it a success. I am still going to investigate methods that do not affect the public API at all, however If I am unsuccessful with that then I will upload the HeuristicFrequencyRank
version as a PR.
It goes against my principles to change the public API (even if it still compiles) AND regress performance (even a small amount) for users that do not care about this feature. It seems to me that this would (rightly) encourage these users to use an older version of the library. Also in rust we like zero cost abstractions. I think I would only feel comfortable introducing something like this if EITHER the public API changes (but not how it is used) OR the default code path is minimally (1%-2%) slower, but not both.
Some more thoughts I had not directly related to what is above, but related to what you said:
Accept a
fn(u8) -> u8
. This is kind of analogous to accepting a &'static [u8; 256] in that it doesn't permit much flexibility
This is interesting, but I think it might be more problematic than at first glance. If a user is going to do anything based on runtime info to generate the frequency table, then they have to include all of the data they need in some static struct that they use in the fn(u8) -> u8
that is passed along to memchr
(similar to the original PR). This involves either slow or unsafe code that ideally the user should not write.
I also think it kind of defeats the purpose of a generic solution if the only way to comfortably use that generic solution is to do the exact same thing you could already do with the non generic solution. In this case, the fn(u8) -> u8
would pretty much just end up being a wrapper for a static array, unless the user does some ugly stuff themselves. In this case, why not just use a static array directly?
If memchr
were to implement static storage and the unsafe code for just the frequency table, then the user can avoid having any statics/unsafe in their code when generating frequency tables at runtime. However, all of what you mentioned in the original PR made me realize that implementing this inside of memchr
is not a good idea, so I don't think I'll be investigating this. I would like to mention however some thoughts I had on this for the sake of completeness (ignore the next paragraph if that does not interest you):
If memchr
has a function to change the global byte frequency table that is thread safe, then the user can avoid all static/unsafe code. Furthermore, since the table is only queried during the construction of a Finder
, that is the only time that the state of the global is relevant at all. If the purpose of a Finder
is to be constructed ahead of time, and then re-used for multiple searches, then the user is probably creating finders in a synchronous loop ahead of time, and then using those finders immutably with multiple threads. If the user could call something like set_byte_frequencies
before constructing each Finder
, which they are likely doing in a small simple loop, then they could arbitrarily modify the frequency table at runtime safely without touching the original code at all. In the original PR I only considered setting a new static byte frequency distribution, not generating arbitrary distributions at runtime, but I think the latter could also be achieved by storing the entire [u8; 256]
in a static variable alongside the pointer that stores the frequency table and copying the runtime frequency table if necessary. Regardless, I am not going down this path anymore but felt it might be helpful/interesting to someone reading this to write some of this down.
Your approach SGTM.
It goes against my principles to change the public API (even if it still compiles) AND regress performance (even a small amount) for users that do not care about this feature. It seems to me that this would (rightly) encourage these users to use an older version of the library. Also in rust we like zero cost abstractions. I think I would only feel comfortable introducing something like this if EITHER the public API changes (but not how it is used) OR the default code path is minimally (1%-2%) slower, but not both.
Just to give you my feelings on the matter:
memmem
from libc), this crate provides a way to mitigate the cost of searcher construction. You can build a searcher once and execute searches with it many times. Now obviously, we still want searcher construction to be fast, so you don't have carte blanch here, but a small regression isn't the end of the world.Basically, in my experience, setting the standard to "doesn't change codegen at all" actually winds up being quite difficult to live up to in practice. Compilers do all sorts of crazy things, and there are cliffs and thresholds everywhere such that even a slight perturbation somewhere has a domino effect that impacts codegen substantially. But that doesn't necessarily mean actual wall clock times are impacted substantially.
So in summary, you probably have a little more wiggle room than you think. And I would be surprised if this trait approach with a default type parameter resulted in a perf change big enough to negate the strategy entirely.
In this case, why not just use a static array directly?
To be clear, the reason is precisely because with a function pointer, the caller can do shenanigans that make it possible to swap out the frequency table for something else at runtime. Asking for a &'static [u8; 256]
means that's not possible to do without either unsound APIs or leaking memory. Basically, the function pointer introduces indirection in a way that is compatible with the no-std no-alloc constraint.
The way something like this would be justified is the idea that needing to use a different ranking function than the default is itself quite niche. And thus, forcing the caller to jump through hoops---but hoops that are possible and sound---to achieve their use case is "okay" with me generally speaking. That includes needing to write unsafe
, especially when this is probably on the "easier" spectrum of unsafe
IMO. But in this particular case, if we can make the default type parameter approach work then I think that's probably better enough to go that route than the more cumbersome fn(u8) -> u8
approach.
Thanks for the insight!
I agree with you about the regressions and was probably exaggerating a bit, but I would much rather have you tell me that than come in here with non-conservative ideas about regressing performance for all users :)
Basically, in my experience, setting the standard to "doesn't change codegen at all" actually winds up being quite difficult
Yeah you are totally right, and I honestly don't usually stress out too much given that the benchmarks are doing what I want them to. However, due to how popular memchr
is and how the whole point of the library is speed I wanted to try everything before making modifications that regress performance.
I do think I've found a good solution. Here's a brief explanation of what I've done (in case you have some thoughts or are curious):
I realized HeuristicFrequencyRank
does not need to be stored anywhere, it only exists for the duration of a call to Searcher::new
. This allowed me to avoid touching the public API at all. Now there are 2 versions of Searcher::new
, where one of them calls the other with a hardcoded value for HeuristicFrequencyRank
.
There are only 2 additions to the public API, and no changes:
HeuristicFrequencyRank
traitFinderBuilder::build_heuristic
generic method that accepts HeuristicFrequencyRank
There are a few tweaks to the internal code but they are also minimal.
Finally, the default code path has (at least on my machine) identical assembly and no performance regression (there is a benchmark to prove this too). Also, the cost of using HeuristicFrequencyRank
when constructing a finder is a result of the performance difference of direct and indirect calls in assembly and pretty much nothing else.
P.S.
I believe it is the case that adding a default type parameter to an existing type is a non-breaking change.
You were right about this, I was able to generate identical assembly to the original implementation with the default type parameter implementation.
Ah right, of course! The table is indeed only used during construction and I believe I generally expect that to remain true.
The second build_heuristic
method is a little kludgy, but is maybe better than adding a type parameter to FinderBuilder
. (Which I think would be required if there was a mutator method to "set" the table.)
The names might need some work too.
The second build_heuristic method is a little kludgy
Yeah you're right, I was having an issue with incorrect codegen (indirect calls) when storing the trait in the Searcher
, so I kind of ignored the idea of storing the trait and moved on. I will change the implementation to use a mutator method and it shouldn't affect performance at all I think since the FinderBuilder
can still call both the generic and regular versions of Searcher::new
, which was the main problem.
So that only leaves the HeuristicFrequencyRank
. My original inclination was to call it something like ByteFrequencies
, but honestly a properly descriptive name seems quite long however you do it.
I would imagine the name of the mutator method should be the name of the trait in snake case or something similar, so I guess if that name is good then the method will also make sense.
Edit: Just realized I misread your comment, apologies.
The reason I thought the second build_heuristic
method is better is to avoid changing the API, but I also agree that a mutator method would be more consistent with the FinderBuilder::prefilter
method.
Either way I will try to think of some better names, especially for build_heuristic
.
Sorry for the radio silence, I've had a lot of things to do this week and since I implemented the HeuristicFrequencyRank
version in a local fork, I have just been using that.
I have documented and commented the code and am ready to upload a pull request, but there are a few more things I wanted to mention before and also list some names I have thought of to see if we can choose the best one. However if you feel we should continue the discussion in a PR I will upload what I have now and we can continue there.
Firstly, when I implemented the mutator method with the default type parameter for setting a custom HeuristicFrequencyRank
I ran into what appears (to me at least) to be a fundamental problem.
The following line is in Finder::new
, but is part of the public API and could be written by a user:
FinderBuilder::new().build_forward(needle)
When FinderBuilder
is generic, then the line above causes an error, regardless of the default type parameter (requires ::<T>
). The following fix illustrates the problem.
type T = FinderBuilder;
T::new().build_forward(needle)
I believe the reason is that FinderBuilder
is interpreted ONLY as a generic type when part of a regular expression, whereas in a type expression FinderBuilder
is parsed based on what comes after. As far as I can tell this is a breaking change to the public API, and I cannot think of any way to resolve this.
Also there were a few other warts that made me feel worse about the default type parameter implementation (could just be me though). Off the top of my head:
Default
for FinderBuilder<H>
to avoid HeuristicFrequencyRank: Default
FinderBuilder
methods to move-only (i.e. fn x(self) { ... }
) to avoid HeuristicFrequencyRank: Copy
FinderBuilder<DefaultH>
-> FinderBuilder<CustomH>
)All in all I prefer the build_heuristic
route due to the minimalism, but if you have some more suggestions I am happy to hear them.
That being said, after staring at my code for a few days I have to say I am definitely not a fan of the names at the moment (both HeuristicFrequencyRank
and build_heuristic
.
One idea I had is to just call it Heuristic
. This would simplify and standardize all of the names, but is not very descriptive. However, I feel the argument can be made that memchr
just uses the one heuristic, so if this heuristic is namespaced (memchr::memmem::Heuristic
) then in theory it shouldn't be ambiguous.
Furthermore, since it is such a niche feature if someone intends to use it properly then they will likely need a much deeper understanding of what Heuristic
does than can be gained from a more descriptive name. You mentioned that I'm the first person to ask for this in the many years this library has existed, and my intention (and I imagine the intention of others like me) is to put code that uses the Heuristic
API in some library code that can be used without worrying about those specifics.
However I am not sure if I am 100% convinced by this argument, but I felt it was definitely worth mentioning. Some more ideas I've had for the names:
HeuristicFrequencyRank
HeuristicFrequency
(there is a rank
method)ByteFrequencies
/ByteFrequency
RareBytes
(public counterpart to RareNeedleBytes
?)RarityHeuristic
/HeuristicRarity
(shorter, and memchr
only deals with bytes anyway)ByteHeuristic
(more descriptive than Heuristic
, similar reasoning)build_heuristic
is a bit harder I think.
build_heuristic
is ambiguous (and does not necessarily imply forward
)build_forward_heuristic
is still kind of ambiguousbuild_forward_with_heuristic
is very longbuild
is ambiguous and would be confusing since this is not a 'common' use caseheuristic
or with_heuristic
are inconsistent with build_forward
/build_reverse
I also think the issues above would still apply if heuristic
were replaced with another word.
Can you throw up what you have? I'll take a look. If the default type parameter path is indeed a dead end, then yeah, I think something like build_heuristic
will do fine.
As for the naming... I'm not quite sure. I think the issue is using "heuristic" absent any other context. There are lots of heuristics that go into memmem
here, so it isn't a particularly discriminating name. The "rank" concept is really the key ingredient for this option, so the word "rank" or a synonym should be involved somewhere here.
A slightly longer name is also acceptable to me personally since this is going to be a very infrequently used feature.
So I'm thinking things like... build_with_ranker
or build_with_byte_ranker
or something along those lines.
Yes I will upload what I have.
There are lots of heuristics that go into memmem here
Yes of course, I meant only one that the public can touch directly, but I agree completely which is part of the reason I wasn't convinced either.
Just to say, build_with_ranker
sounds good to me, especially if combined with a trait that has byte
in the name, but I like both better than build_heuristic
.
Note that control over the heuristic is now available in memchr 2.6
: https://docs.rs/memchr/latest/memchr/memmem/struct.FinderBuilder.html#method.build_forward_with_ranker
memchr
implements a generic SIMD accelerated search that is ideal for implementing something likeCheatEngine
where you scan the memory of an executable process to aid reverse engineering. This process involves repeatedly scanning for possibly millions of small values (u16, i32, ...) in the memory of that process. The user might have information ahead of time about the frequency distribution of bytes in the memory being scanned, which may vary wildly between executables. The user might also be able to control the program to ensure certain rare bytes appear in the program's memory at certain times.There is an issue that prevents
memchr
from performing optimally when scanning binary executables - the byte frequency table. The core algorithm is based on detecting rare bytes with specific positions in haystack (the prefilter) and then testing these matches to check if the needle has been found. As mentioned in the incredibly detailed comments, the performance of this algorithm is highly dependent on the byte frequency table used to determine what is a rare byte. While the table that is included inmemchr
is optimal for the majority of cases, there are some specific data types that have very different byte frequency distributions, which causesmemchr
to perform worse on those inputs than it otherwise might with a different byte frequency table.ideal
is the ideal frequency for an x86 binary):memchr
\x00
\xdd
\x8b
H
Now, consider scanning for the needle
H\x00\xdd\x8b
in an x86 binary.memchr
would identify\x00
and\x8b
as the rarest bytes, when they are in fact common bytes. Even ifmemchr
considered\x00
to be a frequent byte via configuration, it would still chooseH
and\x8b
as the rarest bytes, which are both much more common than\xdd
, the only actually rare byte. This would result in a lot of unnecessary false positives, decreasing the throughput. This is a simple case, but it is easy to extend this idea to many other pathological input sequences that defeat the default frequency table, and might also reasonably appear in an executable or be scanned for by a user.Now consider a haystack that contains
HHH\x00\xdd\x8b
. The user might know in advance that searching forHHH\x00
and searching forH\x00\xdd\x8b
will both return a single unique match, the sub-slice that was mentioned earlier (the exact indices are not identical but that is not the point). The user might also know that\xdd
is a very rare byte in their dataset. The user should be able to choose scanning for\xdd
instead of a more common byte to speed up their searches. I cannot imagine how to support something like this without providing the user a mechanism for customizing the byte frequency table.The proposed solution is to allow the user to specify the byte frequency table at runtime by modifying the
memchr::memmem::rarebytes::rank
function. Currently, this function reads from the global byte frequency table.My first idea was to create an enum that can be provided to a
FinderBuilder
and then forwarded toRareNeedleBytes
to choose the table:This enum can be stored in the
NeedleInfo
struct and used at runtime to determine which byte frequency table to use. However, this introduces the lifetime 'a, which may or may not be the same as the needle ('n) and haystack ('h) lifetimes that are stored in related structs. Considering lifetime 'a to be separate and different requires the public API ofFinder
to be changed to add this lifetime.I believe that the extra lifetime might make life more difficult for the compiler, which is why I observed a small but noticeable (around 10%) impact on the performance of constructing a
Finder
with the default frequency table on my local machine.Also, by introducing a new member on the struct
NeedleInfo
, the size/alignment properties ofFinder
,Searcher
andNeedleInfo
changed, which also might be the reason for the performance impact I observed. (if this sounds crazy to anyone I suggest you take a look at the wonderful performance talk by the legend Emery Berger titled 'Performance Matters' for more details https://www.youtube.com/watch?v=r-TLSBdHe1A).An idea to remove the generic lifetime from
ByteFrequencies
:However, I believe this static API is logically inconsistent with the
FinderBuilder
API. You can construct millions of uniqueFinder
s at runtime and then discard them later, but the same cannot be said for static arrays.Also, the user might want to perform analysis of their specific corpus at runtime to generate a specialized byte frequency table (like 'pre-training'). This is a very interesting use case in the context of the analysis of binary executables, as there is a lot of information that can only be obtained at runtime and can be useful in optimizing many kinds of searches. Forcing the user to use a static byte frequency table would necessarily prevent this use-case.
Another idea to remove the generic lifetime and also allow runtime generation of the byte frequency table:
However, an issue with this approach is that the
ByteFrequencies
enum has a size of 16 bytes which is mostly wasted. Another issue is that it seems that conceptually we should be passing around some kind of reference to a byte table that can be reused, instead of copying the table for each construction, but that ultimately depends on benchmarks. Also, now the standard library and memory allocation are required for an operation that is unrelated to both of those things (Rc
,Arc
and others have similar issues).I also tried storing the byte table inline, but this had disastrous results on performance. This is probably because this extra storage pushed important members on related structs into new cache lines, which affected subsequent operations on these members.
One thing I have not tried yet but might be interesting is trying to re-organize the members and memory layout of any struct that stores a
ByteFrequencies
object. This might allow using an inline byte frequency table for example, but would likely result in breaking changes to the layout of public structs inmemchr
. Even just introducing theByteFrequencies
object already changes the memory layout of certain structs, which I am not sure about whether it is something undesirable or not.All of this culminated in the pull request I submitted, but I realize now it is better to just lay it all out here and figure out the best path forward together. I appreciate any feedback you may have on these suggestions.
P.S. I think
memchr
is an incredible library and the code quality and detail of documentation definitely helped me greatly in understanding the internals and even being able to suggest this in the first place, so kudos.