Closed BurntSushi closed 1 year ago
I will try to port the Python package to this release in a branch, and report back.
Oh that would be great, thank you!
Is there some kind of changelog, or should I have a look at the Git history?
@vks I don't keep a changelog for this crate. I can summarize some of the differences here, but I'm definitely interested in a holistic look at the API that is provided on master. The git history probably won't help much, because a lot of the work happened in one commit. Anyway, here is a summary:
std
, has been added and is enabled by default. This makes the standard library optional and permits using aho-corasick
in core/alloc-only contexts. But you do need alloc
.AhoCorasick
will mostly behave the same as before. AhoCorasick::new
is now fallible. The type is no longer generic. (In 0.7, you can choose the size of the state ID to use, but I did away with this and now just use u32
everywhere.)AhoCorasick::find
now accept an impl Into<Input>
, where an Input
provides a way to configure a search. There is an impl From<H: AsRef<[u8]>> for Input
, so you can still call find
with types like &str
and &[u8]
. But Input
lets you set the bounds of the search, run an anchored search and so on. In 0.7, for example, in order to run an anchored search you had to configure it as a build-time parameter of the AhoCorasick
automaton.AhoCorasick
type can now be configured to support unanchored and anchored searches simultaneously using StartKind
. It defaults to just unanchored searches.supports_overlapping
and supports_stream
routines were removed in favor of try_*
search APIs. I could add the predicates back if needed, but they were principally useful to be able to guarantee that panics didn't happen when you called a search API with an unsupported configuration. But now there are fallible APIs for that.automaton
, dfa
and nfa
. These provide low level access to different types of Aho-Corasick automata, and are not expected to be used for most use cases. The key thing they unlock, by implementing the automaton::Automaton
trait, is the ability to walk the automaton one state at a time. For example, you can now write your own search routines. This might unlock more interesting streaming use cases.packed
submodule remains largely unchanged.In terms of implementation, things largely remain the same with one very (IMO) awesome addition: there is a new type of Aho-Corasick automaton called a "contiguous NFA." The old NFA has been rebraned as a non-contiguous NFA. (And the old DFA is still called a DFA.) The contiguous NFA uses a lot less memory and is typically quite a bit faster than the non-contiguous NFA. It represents a very nice middle point between the slower old NFA and the piggish DFA. Indeed, most AhoCorasick
configurations in the 1.0 release will use this new contiguous NFA.
You should consider adding an alloc
feature so that you can add non-allocating APIs in the future without a breaking change. For now, you could compile out everything when the feature is disabled.
@Darksonn I thought about that, and I might still do it, but I don't see a core-only mode happening. The only way it happens is if I had zero-copy deserialization APIs, and the only way that happens is with API changes. So I'd need a aho-corasick 2.0
anyway for something like that.
Level of familiarity: I used this library for a toy program once. When I started reviewing I already knew its purpose but I had only a vague vision of both internals and externals. I can only guess what real-world users do and don't care about.
For this pass I only read docs and a little library code. I didn't yet try it out by writing code.
I wrote down everything I noticed, some of it's probably too subjective or missing the point.
At an API level, enabling
std
generally just enablesstd::error::Error
trait impls for the various error types.
Looks like it also enables the stream methods?
This experiment very strongly argues that a contiguous NFA is often the best balance in terms of resource usage. It takes a little longer to build, but its memory usage is quite small. Its search speed (not listed) is also often faster than a noncontiguous NFA, but a little slower than a DFA. Indeed, when
AhoCorasickKind::Auto
is used (which is the default), a contiguous NFA is used in most cases.
What's the catch? Are there situations where a noncontiguous NFA has a much greater advantage than a 10% shorter build time? Would I realistically ever want to choose it explicitly or is it a legacy/unlikely niche thing? "Often the best" and "often faster" sound a little tentative.
OK, so after writing that I read the code and saw that building a contiguous NFA is not possible if there are too many states, and that's the one case in which Auto
picks it. Does that deserve a warning or is it an extreme case? Would an AhoCorasickKind
variant for "never give me a DFA, but do give me a noncontiguous NFA if you must" be worthwhile? (I guess there's no rush since it's #[non_exhaustive]
and build_from_noncontiguous
lets you implement it yourself.)
15MB for a
contiguous::NFA
Looks like it constructs it from a noncontiguous NFA, so peak memory use during construction would be way higher (114MB?)—is this something to warn about?
Construction can fail in generally one way: when the inputs provided are too big.
It might be helpful to know approximately at which scale that happens, to judge whether it's so outlandish that panicking on trusted data makes sense.
Input::span()
,Input::range()
Why use this instead of slicing the haystack? I suppose it's for "the end bound is a valid ending bound for the haystack and the start bound is exactly one greater than the start bound", to get no matches even when searching for an empty string, but I'm not sure I got it right and it took some work putting the pieces together. I'm still fuzzy on the shape of the code that makes use of this.
So maybe the docs should preempt this question.
(I also wondered why Span
only implemented From<Range<usize>>
and none of the other range types but now I get that it has to be bounded. Still, it feels too intricate, with multiple types and methods for similar things. Not that I have a better proposal.)
(
AhoCorasickBuilder::byte_classes()
) Whether to attempt to shrink the size of the automaton’s alphabet or not.This option is enabled by default and should never be disabled unless one is debugging the underlying automaton.
The warning is probably more than enough, but I wonder if it would be better to already frame it as a debugging aid starting in the first sentence or even the method name (rather than describing it as a setting that happens to only be useful for debugging).
Match::new()
,Match::must()
Why are these public? Why is must
called that?
PatternID::new()
,PatternIDError
Similar here, are users going to construct these? And are they going to do it so often that it deserves a public error type for the single failure condition?
enum Anchored
Is this just a fancy bool? Will it grow more variants? (If so, should it be #[non_exhaustive]
?)
find()
,find_overlapping()
,find_iter()
,find_overlapping_iter()
,stream_find_iter()
, ...
Two holes in this API would be stream_find()
and stream_find_overlapping_iter()
. I imagine stream_find_overlapping_iter()
might not exist because it'd require backtracking (for NFAs only?)—is that worth explaining in the stream_find_iter()
doc in case people go looking for it?
AhoCorasick::max_pattern_len()
I instinctively wondered whether this was expensive (it's an O(1) getter, albeit behind a vtable). Maybe worth documenting? I'm not sure.
replace_all()
,replace_all_with()
I was slightly surprised that these have a different output strategy (return value vs destination buffer). I first saw the method names, then looked at replace_all()
, and initially assumed that there wouldn't be a version with a dst
argument, since with
doesn't suggest that. But I don't have a concrete suggestion, probably fine as is.
the start bound is exactly one greater than the start bound.
greater than the end bound
Return the internal value as a i32
. This is guaranteed to never overflow an
i32`.
Missing backtick
isize
too, I think?
A line break messed up the markdown
@blyxxyz Excellent feedback, thank you. You bring up a lot of good points. I'll respond in more depth soon, hopefully next week. :-)
Thanks for tagging me on this considering I've only left one comment in this repo :) I think trait Automaton
is a great addition for manual state handling. For the project I was working on, it would have at least have allowed me to experiment without having to fork this repo. I'm currently working at a similar level with regex-automata
and it's nice to see some consistency between the two APIs (eg. The semantics/explanation of a dead state, getting matches by passing the StateId
to the Automaton
).
However, I think there's room for a slightly higher level API that works by pushing &[u8]
chunks and abstracts some of the more error prone state transition code. Or, at least keeps it neatly separated from buffer handling. Something like:
fn scan(aut : &impl Automaton, partial_haystack : &'a [u8], state : StateId) -> ScanResult {...}
enun ScanResult {
// Got to a match state, this state ID is used to get match pattern IDs and can be used in the next scan call for overlapping matches
Match(usize, StateId),
// Got to a dead state
Dead(usize),
// Reached end of partial_haystack without a special state, this id should be used for the call with more input
NeedsInput(StateId),
}
The caller would be in charge of checking the matched patterns and lengths, offsetting the haystack, and buffering on NeedsInput
, etc. This function is generic enough to allow multiple types of matching (maybe not optimally): overlapping, leftmost, longest, anchored, negative match; all in multiple contexts: reading from Read
, reading fromAsyncRead
, decorating a Read
or AsyncRead
(eg. remove keywords from Read
and have it implement Read
), reading and implementing Iterator
s, etc.
Anyways, I don't have any delusions of this being in 1.0, and I think a big benefit of the current API is that these higher level abstractions can live in a completely separate crate.
@ear7h Yeah I do definitely want to hold off on those sorts of mid-level APIs for now (and possibly forever, I dunno). I actually just don't have a great feel for what is actually needed or not. If you'd like to discuss those sorts of API additions in more detail, happy to have that done in a new issue. I want to try to keep this issue focused on the 1.0 API. (That doesn't mean there can't be additions, but in my mind, the scope there is much smaller than adding entirely new search APIs.)
I'd also encourage you to look at the existing stream search implementation. There are some comments in there indicating some interesting challenges.
I read through it all. Looks good to me. 👍
The only comment I have is that the name of the crate option perf-literal
is not very helpful. The description in the documentation doesn't shed much light on why it has the name it has.
perf-literal - Enables support for prefilters that use vectorized routines from external crates. This feature is enabled by default. If you’re only using Aho-Corasick for large numbers of patterns or otherwise can abide lower throughput when searching with a small number of patterns, then it is reasonable to disable this feature.
It's not exactly a big deal. The Rust ecosystem in general does not do a good job naming crate options. Maybe people are used to consulting the docs to understand crate options—which they should be doing anyway.
I also want to say I really like that you expose the low-level stuff. I have lost count how many times I've wanted access to something a crate doesn't make public.
I already started using this crate in a project and so far I'm enjoying the experience. The API is clear, easy to use, and well documented. These are the only two areas of improvement I've identified:
In my project I wanted to build the Aho-Corasick automaton and use it simultaneously from multiple threads. The documentation doesn't state whether the automaton is thread-safe, so I had to experiment in order to find out that this is perfectly ok (as I expected) and the AhoCorasick
type is both Send
and Sync
. Thread-safety considerations are worth to be included in the documentation, given that this is possibly a very common use case.
I find the method AhoCorasickBuilder::prefilter
a bit confusing, the description is clear enough, but the name of the method itself doesn't reflect its purpose. I would find something like AhoCorasickBuilder::use_heuristics
more appropriate for the public API. It looks that this method gets its name from the internal pre-filtering mechanism, but I see prefilters as a building block for search heuristics. In other words, using the pre-filtering mechanism you implement a set of specific pre-filters that together conform your search heuristics.
When I saw this option I though that the public API had some support for pre-filters, as in the RE2 library, but this is not the case. In my opinion the internal term "prefilter" should not be exposed publicly.
@blyxxyz
At an API level, enabling
std
generally just enablesstd::error::Error
trait impls for the various error types.Looks like it also enables the stream methods?
Yup! Good catch. Fixed.
What's the catch? Are there situations where a noncontiguous NFA has a much greater advantage than a 10% shorter build time? Would I realistically ever want to choose it explicitly or is it a legacy/unlikely niche thing? "Often the best" and "often faster" sound a little tentative.
OK, so after writing that I read the code and saw that building a contiguous NFA is not possible if there are too many states, and that's the one case in which
Auto
picks it. Does that deserve a warning or is it an extreme case? Would anAhoCorasickKind
variant for "never give me a DFA, but do give me a noncontiguous NFA if you must" be worthwhile? (I guess there's no rush since it's#[non_exhaustive]
andbuild_from_noncontiguous
lets you implement it yourself.)
I've added this to the docs in that section:
/// The only "catch" to using a contiguous NFA is that, because of its variety
/// of compression tricks, it may not be able to support automatons as large as
/// what the noncontiguous NFA supports. In which case, building a contiguous
/// NFA will fail and (by default) `AhoCorasick` will automatically fall
/// back to a noncontiguous NFA. (This typically only happens when building
/// automatons from millions of patterns.) Otherwise, the small additional time
/// for building a contiguous NFA is almost certainly worth it.
In other words, yes, it's a pretty extreme case. You probably need millions of patterns before failing to build a contiguous NFA.
As for "never me a DFA," I think I'll pass on that. Generally, a DFA is only used when the number of patterns is very small, which seems like something folks can live with in most cases. And indeed, the lower level APIs permit any kind of customization that you need.
Looks like it constructs it from a noncontiguous NFA, so peak memory use during construction would be way higher (114MB?)—is this something to warn about?
I added this note:
/// (Note that the memory usage above reflects the size of each automaton and
/// not peak memory usage. For example, building a contiguous NFA requires
/// first building a noncontiguous NFA. Once the contiguous NFA is built, the
/// noncontiguous NFA is freed.)
It might be helpful to know approximately at which scale that happens, to judge whether it's so outlandish that panicking on trusted data makes sense.
I added this note:
/// A first approximation for the scale at which
/// construction can fail is somewhere around "millions of patterns."
I'm not sure this really impacts whether one should panic on trusted data or not. Or more specifically, patterns known at compile time. Basically, construction will either always fail or always succeed, so if your compile time patterns are so numerous and/or big that automaton construction fails, you'll know it immediately.
If it's data that is derived from runtime, then I'd just return an error.
Basically, the same decision procedure as for Regex::new
.
Why use this instead of slicing the haystack? I suppose it's for "the end bound is a valid ending bound for the haystack and the start bound is exactly one greater than the start bound", to get no matches even when searching for an empty string, but I'm not sure I got it right and it took some work putting the pieces together. I'm still fuzzy on the shape of the code that makes use of this.
So maybe the docs should preempt this question.
(I also wondered why
Span
only implementedFrom<Range<usize>>
and none of the other range types but now I get that it has to be bounded. Still, it feels too intricate, with multiple types and methods for similar things. Not that I have a better proposal.)
This is a good question. The Input
type is an abstraction I developed for regex-automata
, and there, the use of Input::span
/Input::range
is more concretely motivated: a regex search might have look-around assertions, so for example, in order to correctly implement iteration, you need to tell the regex engine both the haystack and the offsets within that haystack to search.
Consider for example the regex \bfoo\b
and a haystack h = "barfoo"
. A search on &h[3..]
will return a match but a search on h
starting at position 3
will not return a match.
But obviously, AhoCorasick
does not have look-around assertions. So technically speaking, yes, it is never necessary to use the span
/range
APIs unless you need to represent "search is done." Which does seem like kind of a bummer and maybe the API complexity is indeed too high.
One benefit of these APIs though is that if you use them instead of slicing the haystack, then the match positions reported will in turn be offset for you. Otherwise, executing a search on, say, &h[n..]
usually means adding n
to whatever match position is returned. Using Input::span
absolves you from having to worry about that particular detail.
I've added this note to the docs:
/// Other than representing "search is complete," the `Input::span` and
/// `Input::range` APIs are never necessary. Instead, callers can slice the
/// haystack instead, e.g., with `&haystack[start..end]`. With that said, they
/// can be more convenient than slicing because the match positions reported
/// when using `Input::span` or `Input::range` are in terms of the original
/// haystack. If you instead use `&haystack[start..end]`, then you'll need to
/// add `start` to any match position returned in order for it to be a correct
/// index into `haystack`.
As for Span
, the main reason it exists is because Range<usize>
doesn't implement Copy
. It's just too annoying to deal with otherwise. I've also added this note to the docs for Span
:
/// Indeed, `Span` exists only because `Range<usize>` does
/// not implement `Copy`.
The warning is probably more than enough, but I wonder if it would be better to already frame it as a debugging aid starting in the first sentence or even the method name (rather than describing it as a setting that happens to only be useful for debugging).
Good point. I'll keep the name,but I've changed the docs to start with "A debug setting ..."
Match::new()
,Match::must()
Why are these public? Why is
must
called that?
Match::new
is public because users of this crate should be able to construct their own Match
objects. It's useful for tests. Also, there are lower level APIs and callers can technically provide their own implementation of the Automaton
trait. (Although I have considered sealing it to start with, because otherwise adding a new non-default method will be a breaking change and I'm not sure I want to box myself in.) To do that, they need to be able to build Match
values.
Match::must
is effectively a convenience wrapper around Match::new
, but absolves the caller from having to build a PatternID
. That is, the pattern ID "must" be valid.
There are a fair number of different ways I could go about this I think. For example, another approach might be:
impl Match {
// this panics if the TryInto conversion fails
fn new<P: TryInto<PatternID>, S: Into<Span>>(pid: P, span: S) -> Match;
// this returns an error if the TryInto conversion fails
fn new_try<E, P: TryInto<PatternID, Error=E>, S: Into<Span>>(pid: P, span: S) -> Result<Match, E>;
}
But this is kind of a mouthful I think? There's probably a bigger design space here to be honest.
Similar here, are users going to construct these? And are they going to do it so often that it deserves a public error type for the single failure condition?
I'll noodle on this. PatternID::new
should definitely be public for the same reason Match::new
should be public. And if PatternID::new
is public, then I don't really see how to get around returning an error type there. I could return a Box<dyn Error>
instead, but that seems a bit idiosyncratic.
enum Anchored
Is this just a fancy bool? Will it grow more variants? (If so, should it be
#[non_exhaustive]
?)
Yes, it's a fancy bool. I don't think there are any other variants it could get, but I'll mark it as non_exhaustive
to be conservative. We can revisit that if someone has a use case for making it exhaustive.
Two holes in this API would be
stream_find()
andstream_find_overlapping_iter()
. I imaginestream_find_overlapping_iter()
might not exist because it'd require backtracking (for NFAs only?)—is that worth explaining in thestream_find_iter()
doc in case people go looking for it?
I don't think I fully understand what is or isn't possible in the "stream searching" case. If folks need a specific kind of stream searching API, I'd expect them to file an issue describing their use case. (Which I think is true for any API that doesn't exist but that someone wants.)
AhoCorasick::max_pattern_len()
I instinctively wondered whether this was expensive (it's an O(1) getter, albeit behind a vtable). Maybe worth documenting? I'm not sure.
Hmmm. The problem with documenting that this is a cheap getter method is that there are a whole bunch of cheap getter methods. And if this one is documented to be cheap but the others aren't, then that could lead one to conclude that the others aren't cheap. So if I document this one, I kind of need to just slap that on all of the getter methods. And that seems like kind of a bummer to me.
I'll leave this one as-is for now.
I was slightly surprised that these have a different output strategy (return value vs destination buffer). I first saw the method names, then looked at
replace_all()
, and initially assumed that there wouldn't be a version with adst
argument, sincewith
doesn't suggest that. But I don't have a concrete suggestion, probably fine as is.
Yeah I think it's just more a case of not adding all possible permutations. That is, you can either use the higher level convenience method, or if you want more control over replacements then you also need to supply the destination buffer. Otherwise, I'd need to expand it to add one with just the destination buffer and one with just the closure.
Thanks again for the excellent feedback!
@rljacobson
The only comment I have is that the name of the crate option
perf-literal
is not very helpful. The description in the documentation doesn't shed much light on why it has the name it has.
Yeah that's a good point. The feature name is derived from the same name of the same feature from the regex
crate. In that context, "literal" is something distinct from "regex," but here in aho-corasick
, "literal" is not distinct. All patterns are just literals.
I used "literal prefilter" in the docs, but I'm going to keep the name to be consistent with regex
.
I already started using this crate in a project and so far I'm enjoying the experience. The API is clear, easy to use, and well documented. These are the only two areas of improvement I've identified:
Thanks! :-)
* In my project I wanted to build the Aho-Corasick automaton and use it simultaneously from multiple threads. The documentation doesn't state whether the automaton is thread-safe, so I had to experiment in order to find out that this is perfectly ok (as I expected) and the `AhoCorasick` type is both `Send` and `Sync`. Thread-safety considerations are worth to be included in the documentation, given that this is possibly a very common use case.
I'm going to pass on this. The reason why is because it's already documented by virtue of being both Send
and Sync
. That's all there is to it. There aren't any footguns here. You literally cannot misuse an AhoCorasick
value in a multi-threaded environment.
I'm not sure if you're new to Rust or not, but if you are, I can see how this might be too subtle and the docs might be a good place to state it explicitly. The problem is that this is just repeating something that is already indicated by the traits it implements. If we were to take your route, the docs for most types would say "this can be used from multiple threads simultaneously." In other words, it just becomes redundant noise.
* I find the method `AhoCorasickBuilder::prefilter` a bit confusing, the description is clear enough, but the name of the method itself doesn't reflect its purpose. I would find something like `AhoCorasickBuilder::use_heuristics` more appropriate for the public API. It looks that this method gets its name from the internal pre-filtering mechanism, but I see prefilters as a building block for search heuristics. In other words, using the pre-filtering mechanism you implement a set of specific pre-filters that together conform your search heuristics. When I saw this option I though that the public API had some support for pre-filters, as in the RE2 library, but this is not the case. In my opinion the internal term "prefilter" should not be exposed publicly.
I appreciate your perspective here, but prefilter
is IMO the right term. I have multiple arguments here:
use_heuristics
does. It is more precise because this crate uses several heuristics and not all of them are prefilters. Prefilters are a specific kind of heuristic that callers might want to disable, but it doesn't make sense to disable all of them. (And indeed, they practically can't.)memchr
via memchr::memmem::FinderBuilder::prefilter
and, soon, regex-automata::util::prefilter
. They all mean the same thing.
I would love to collect feedback on the API of this crate before I publish a 1.0 release. My plan is to publish a 1.0 release Very Soon, probably within the next month.
Here are the docs for the code that is on current master, which will become the 1.0 release: https://burntsushi.net/stuff/tmp-do-not-link-me/aho-corasick/aho_corasick/
For most uses of this crate, only very small changes will be needed to migrate from the current 0.7 release to 1.0. Probably the biggest change is that
AhoCorasick::new
is now fallible.Anyway, I'd love to hear feedback on the API. I welcome feedback from all perspectives. To be doubly clear there, even if you have no earthly clue what this funny sounding "Aho-Corasick" thing is, I want feedback from you. I also want feedback from you if you've been using this library for years already.
NOTE: The 1.0 release will bump the MSRV to Rust 1.60, which was released about 1 year ago. But note that, as is consistent with current policy, I do not consider MSRV bumps to be semver incompatible changes. With that said, as this is a widely used library (albeit indirectly, primarily as a dependency of the
regex
crate), I will generally try to stay pretty conservative with respect to MSRV bumps.CC'ing some folks that I would love to hear from if they have the time: @itamarst, @plusvic, @rljacobson, @Triften, @ear7h, @vks, @Marwes