BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

RFC: review API before 1.0 release #108

Closed BurntSushi closed 1 year ago

BurntSushi commented 1 year ago

I would love to collect feedback on the API of this crate before I publish a 1.0 release. My plan is to publish a 1.0 release Very Soon, probably within the next month.

Here are the docs for the code that is on current master, which will become the 1.0 release: https://burntsushi.net/stuff/tmp-do-not-link-me/aho-corasick/aho_corasick/

For most uses of this crate, only very small changes will be needed to migrate from the current 0.7 release to 1.0. Probably the biggest change is that AhoCorasick::new is now fallible.

Anyway, I'd love to hear feedback on the API. I welcome feedback from all perspectives. To be doubly clear there, even if you have no earthly clue what this funny sounding "Aho-Corasick" thing is, I want feedback from you. I also want feedback from you if you've been using this library for years already.

NOTE: The 1.0 release will bump the MSRV to Rust 1.60, which was released about 1 year ago. But note that, as is consistent with current policy, I do not consider MSRV bumps to be semver incompatible changes. With that said, as this is a widely used library (albeit indirectly, primarily as a dependency of the regex crate), I will generally try to stay pretty conservative with respect to MSRV bumps.

CC'ing some folks that I would love to hear from if they have the time: @itamarst, @plusvic, @rljacobson, @Triften, @ear7h, @vks, @Marwes

itamarst commented 1 year ago

I will try to port the Python package to this release in a branch, and report back.

BurntSushi commented 1 year ago

Oh that would be great, thank you!

vks commented 1 year ago

Is there some kind of changelog, or should I have a look at the Git history?

BurntSushi commented 1 year ago

@vks I don't keep a changelog for this crate. I can summarize some of the differences here, but I'm definitely interested in a holistic look at the API that is provided on master. The git history probably won't help much, because a lot of the work happened in one commit. Anyway, here is a summary:

In terms of implementation, things largely remain the same with one very (IMO) awesome addition: there is a new type of Aho-Corasick automaton called a "contiguous NFA." The old NFA has been rebraned as a non-contiguous NFA. (And the old DFA is still called a DFA.) The contiguous NFA uses a lot less memory and is typically quite a bit faster than the non-contiguous NFA. It represents a very nice middle point between the slower old NFA and the piggish DFA. Indeed, most AhoCorasick configurations in the 1.0 release will use this new contiguous NFA.

Darksonn commented 1 year ago

You should consider adding an alloc feature so that you can add non-allocating APIs in the future without a breaking change. For now, you could compile out everything when the feature is disabled.

BurntSushi commented 1 year ago

@Darksonn I thought about that, and I might still do it, but I don't see a core-only mode happening. The only way it happens is if I had zero-copy deserialization APIs, and the only way that happens is with API changes. So I'd need a aho-corasick 2.0 anyway for something like that.

blyxxyz commented 1 year ago

Level of familiarity: I used this library for a toy program once. When I started reviewing I already knew its purpose but I had only a vague vision of both internals and externals. I can only guess what real-world users do and don't care about.

For this pass I only read docs and a little library code. I didn't yet try it out by writing code.

I wrote down everything I noticed, some of it's probably too subjective or missing the point.

At an API level, enabling std generally just enables std::error::Error trait impls for the various error types.

Looks like it also enables the stream methods?

This experiment very strongly argues that a contiguous NFA is often the best balance in terms of resource usage. It takes a little longer to build, but its memory usage is quite small. Its search speed (not listed) is also often faster than a noncontiguous NFA, but a little slower than a DFA. Indeed, when AhoCorasickKind::Auto is used (which is the default), a contiguous NFA is used in most cases.

What's the catch? Are there situations where a noncontiguous NFA has a much greater advantage than a 10% shorter build time? Would I realistically ever want to choose it explicitly or is it a legacy/unlikely niche thing? "Often the best" and "often faster" sound a little tentative.

OK, so after writing that I read the code and saw that building a contiguous NFA is not possible if there are too many states, and that's the one case in which Auto picks it. Does that deserve a warning or is it an extreme case? Would an AhoCorasickKind variant for "never give me a DFA, but do give me a noncontiguous NFA if you must" be worthwhile? (I guess there's no rush since it's #[non_exhaustive] and build_from_noncontiguous lets you implement it yourself.)

15MB for a contiguous::NFA

Looks like it constructs it from a noncontiguous NFA, so peak memory use during construction would be way higher (114MB?)—is this something to warn about?

Construction can fail in generally one way: when the inputs provided are too big.

It might be helpful to know approximately at which scale that happens, to judge whether it's so outlandish that panicking on trusted data makes sense.

Input::span(), Input::range()

Why use this instead of slicing the haystack? I suppose it's for "the end bound is a valid ending bound for the haystack and the start bound is exactly one greater than the start bound", to get no matches even when searching for an empty string, but I'm not sure I got it right and it took some work putting the pieces together. I'm still fuzzy on the shape of the code that makes use of this.

So maybe the docs should preempt this question.

(I also wondered why Span only implemented From<Range<usize>> and none of the other range types but now I get that it has to be bounded. Still, it feels too intricate, with multiple types and methods for similar things. Not that I have a better proposal.)

(AhoCorasickBuilder::byte_classes()) Whether to attempt to shrink the size of the automaton’s alphabet or not.

This option is enabled by default and should never be disabled unless one is debugging the underlying automaton.

The warning is probably more than enough, but I wonder if it would be better to already frame it as a debugging aid starting in the first sentence or even the method name (rather than describing it as a setting that happens to only be useful for debugging).

Match::new(), Match::must()

Why are these public? Why is must called that?

PatternID::new(), PatternIDError

Similar here, are users going to construct these? And are they going to do it so often that it deserves a public error type for the single failure condition?

enum Anchored

Is this just a fancy bool? Will it grow more variants? (If so, should it be #[non_exhaustive]?)

find(), find_overlapping(), find_iter(), find_overlapping_iter(), stream_find_iter(), ...

Two holes in this API would be stream_find() and stream_find_overlapping_iter(). I imagine stream_find_overlapping_iter() might not exist because it'd require backtracking (for NFAs only?)—is that worth explaining in the stream_find_iter() doc in case people go looking for it?

AhoCorasick::max_pattern_len()

I instinctively wondered whether this was expensive (it's an O(1) getter, albeit behind a vtable). Maybe worth documenting? I'm not sure.

replace_all(), replace_all_with()

I was slightly surprised that these have a different output strategy (return value vs destination buffer). I first saw the method names, then looked at replace_all(), and initially assumed that there wouldn't be a version with a dst argument, since with doesn't suggest that. But I don't have a concrete suggestion, probably fine as is.

Typos

the start bound is exactly one greater than the start bound.

greater than the end bound

Return the internal value as a i32. This is guaranteed to never overflow ani32`.

Missing backtick

Since values represented by a “small index” have constraints on their maximum value, adding 1 to it will always fit in a usize, u32 and a i32.

isize too, I think?

AhoCorasickBuilder::byte_classes()

A line break messed up the markdown

BurntSushi commented 1 year ago

@blyxxyz Excellent feedback, thank you. You bring up a lot of good points. I'll respond in more depth soon, hopefully next week. :-)

ear7h commented 1 year ago

Thanks for tagging me on this considering I've only left one comment in this repo :) I think trait Automaton is a great addition for manual state handling. For the project I was working on, it would have at least have allowed me to experiment without having to fork this repo. I'm currently working at a similar level with regex-automata and it's nice to see some consistency between the two APIs (eg. The semantics/explanation of a dead state, getting matches by passing the StateId to the Automaton).

However, I think there's room for a slightly higher level API that works by pushing &[u8] chunks and abstracts some of the more error prone state transition code. Or, at least keeps it neatly separated from buffer handling. Something like:

fn scan(aut : &impl Automaton, partial_haystack : &'a [u8], state : StateId) -> ScanResult {...}

enun ScanResult {
    // Got to a match state, this state ID is used to get match pattern IDs and can be used in the next scan call for overlapping matches
    Match(usize, StateId),
    // Got to a dead state
    Dead(usize),
    // Reached end of partial_haystack without a special state, this id should be used for the call with more input
    NeedsInput(StateId),
}

The caller would be in charge of checking the matched patterns and lengths, offsetting the haystack, and buffering on NeedsInput, etc. This function is generic enough to allow multiple types of matching (maybe not optimally): overlapping, leftmost, longest, anchored, negative match; all in multiple contexts: reading from Read, reading fromAsyncRead, decorating a Read or AsyncRead (eg. remove keywords from Read and have it implement Read), reading and implementing Iterators, etc.

Anyways, I don't have any delusions of this being in 1.0, and I think a big benefit of the current API is that these higher level abstractions can live in a completely separate crate.

BurntSushi commented 1 year ago

@ear7h Yeah I do definitely want to hold off on those sorts of mid-level APIs for now (and possibly forever, I dunno). I actually just don't have a great feel for what is actually needed or not. If you'd like to discuss those sorts of API additions in more detail, happy to have that done in a new issue. I want to try to keep this issue focused on the 1.0 API. (That doesn't mean there can't be additions, but in my mind, the scope there is much smaller than adding entirely new search APIs.)

I'd also encourage you to look at the existing stream search implementation. There are some comments in there indicating some interesting challenges.

rljacobson commented 1 year ago

I read through it all. Looks good to me. 👍

The only comment I have is that the name of the crate option perf-literal is not very helpful. The description in the documentation doesn't shed much light on why it has the name it has.

perf-literal - Enables support for prefilters that use vectorized routines from external crates. This feature is enabled by default. If you’re only using Aho-Corasick for large numbers of patterns or otherwise can abide lower throughput when searching with a small number of patterns, then it is reasonable to disable this feature.

It's not exactly a big deal. The Rust ecosystem in general does not do a good job naming crate options. Maybe people are used to consulting the docs to understand crate options—which they should be doing anyway.

I also want to say I really like that you expose the low-level stuff. I have lost count how many times I've wanted access to something a crate doesn't make public.

plusvic commented 1 year ago

I already started using this crate in a project and so far I'm enjoying the experience. The API is clear, easy to use, and well documented. These are the only two areas of improvement I've identified:

BurntSushi commented 1 year ago

@blyxxyz

At an API level, enabling std generally just enables std::error::Error trait impls for the various error types.

Looks like it also enables the stream methods?

Yup! Good catch. Fixed.

What's the catch? Are there situations where a noncontiguous NFA has a much greater advantage than a 10% shorter build time? Would I realistically ever want to choose it explicitly or is it a legacy/unlikely niche thing? "Often the best" and "often faster" sound a little tentative.

OK, so after writing that I read the code and saw that building a contiguous NFA is not possible if there are too many states, and that's the one case in which Auto picks it. Does that deserve a warning or is it an extreme case? Would an AhoCorasickKind variant for "never give me a DFA, but do give me a noncontiguous NFA if you must" be worthwhile? (I guess there's no rush since it's #[non_exhaustive] and build_from_noncontiguous lets you implement it yourself.)

I've added this to the docs in that section:

/// The only "catch" to using a contiguous NFA is that, because of its variety
/// of compression tricks, it may not be able to support automatons as large as
/// what the noncontiguous NFA supports. In which case, building a contiguous
/// NFA will fail and (by default) `AhoCorasick` will automatically fall
/// back to a noncontiguous NFA. (This typically only happens when building
/// automatons from millions of patterns.) Otherwise, the small additional time
/// for building a contiguous NFA is almost certainly worth it.

In other words, yes, it's a pretty extreme case. You probably need millions of patterns before failing to build a contiguous NFA.

As for "never me a DFA," I think I'll pass on that. Generally, a DFA is only used when the number of patterns is very small, which seems like something folks can live with in most cases. And indeed, the lower level APIs permit any kind of customization that you need.

Looks like it constructs it from a noncontiguous NFA, so peak memory use during construction would be way higher (114MB?)—is this something to warn about?

I added this note:

/// (Note that the memory usage above reflects the size of each automaton and
/// not peak memory usage. For example, building a contiguous NFA requires
/// first building a noncontiguous NFA. Once the contiguous NFA is built, the
/// noncontiguous NFA is freed.)

It might be helpful to know approximately at which scale that happens, to judge whether it's so outlandish that panicking on trusted data makes sense.

I added this note:

/// A first approximation for the scale at which
/// construction can fail is somewhere around "millions of patterns."

I'm not sure this really impacts whether one should panic on trusted data or not. Or more specifically, patterns known at compile time. Basically, construction will either always fail or always succeed, so if your compile time patterns are so numerous and/or big that automaton construction fails, you'll know it immediately.

If it's data that is derived from runtime, then I'd just return an error.

Basically, the same decision procedure as for Regex::new.

Why use this instead of slicing the haystack? I suppose it's for "the end bound is a valid ending bound for the haystack and the start bound is exactly one greater than the start bound", to get no matches even when searching for an empty string, but I'm not sure I got it right and it took some work putting the pieces together. I'm still fuzzy on the shape of the code that makes use of this.

So maybe the docs should preempt this question.

(I also wondered why Span only implemented From<Range<usize>> and none of the other range types but now I get that it has to be bounded. Still, it feels too intricate, with multiple types and methods for similar things. Not that I have a better proposal.)

This is a good question. The Input type is an abstraction I developed for regex-automata, and there, the use of Input::span/Input::range is more concretely motivated: a regex search might have look-around assertions, so for example, in order to correctly implement iteration, you need to tell the regex engine both the haystack and the offsets within that haystack to search.

Consider for example the regex \bfoo\b and a haystack h = "barfoo". A search on &h[3..] will return a match but a search on h starting at position 3 will not return a match.

But obviously, AhoCorasick does not have look-around assertions. So technically speaking, yes, it is never necessary to use the span/range APIs unless you need to represent "search is done." Which does seem like kind of a bummer and maybe the API complexity is indeed too high.

One benefit of these APIs though is that if you use them instead of slicing the haystack, then the match positions reported will in turn be offset for you. Otherwise, executing a search on, say, &h[n..] usually means adding n to whatever match position is returned. Using Input::span absolves you from having to worry about that particular detail.

I've added this note to the docs:

/// Other than representing "search is complete," the `Input::span` and
/// `Input::range` APIs are never necessary. Instead, callers can slice the
/// haystack instead, e.g., with `&haystack[start..end]`. With that said, they
/// can be more convenient than slicing because the match positions reported
/// when using `Input::span` or `Input::range` are in terms of the original
/// haystack. If you instead use `&haystack[start..end]`, then you'll need to
/// add `start` to any match position returned in order for it to be a correct
/// index into `haystack`.

As for Span, the main reason it exists is because Range<usize> doesn't implement Copy. It's just too annoying to deal with otherwise. I've also added this note to the docs for Span:

/// Indeed, `Span` exists only because `Range<usize>` does
/// not implement `Copy`.

The warning is probably more than enough, but I wonder if it would be better to already frame it as a debugging aid starting in the first sentence or even the method name (rather than describing it as a setting that happens to only be useful for debugging).

Good point. I'll keep the name,but I've changed the docs to start with "A debug setting ..."

Match::new(), Match::must()

Why are these public? Why is must called that?

Match::new is public because users of this crate should be able to construct their own Match objects. It's useful for tests. Also, there are lower level APIs and callers can technically provide their own implementation of the Automaton trait. (Although I have considered sealing it to start with, because otherwise adding a new non-default method will be a breaking change and I'm not sure I want to box myself in.) To do that, they need to be able to build Match values.

Match::must is effectively a convenience wrapper around Match::new, but absolves the caller from having to build a PatternID. That is, the pattern ID "must" be valid.

There are a fair number of different ways I could go about this I think. For example, another approach might be:

impl Match {
    // this panics if the TryInto conversion fails
    fn new<P: TryInto<PatternID>, S: Into<Span>>(pid: P, span: S) -> Match;
    // this returns an error if the TryInto conversion fails
    fn new_try<E, P: TryInto<PatternID, Error=E>, S: Into<Span>>(pid: P, span: S) -> Result<Match, E>;
}

But this is kind of a mouthful I think? There's probably a bigger design space here to be honest.

Similar here, are users going to construct these? And are they going to do it so often that it deserves a public error type for the single failure condition?

I'll noodle on this. PatternID::new should definitely be public for the same reason Match::new should be public. And if PatternID::new is public, then I don't really see how to get around returning an error type there. I could return a Box<dyn Error> instead, but that seems a bit idiosyncratic.

enum Anchored

Is this just a fancy bool? Will it grow more variants? (If so, should it be #[non_exhaustive]?)

Yes, it's a fancy bool. I don't think there are any other variants it could get, but I'll mark it as non_exhaustive to be conservative. We can revisit that if someone has a use case for making it exhaustive.

Two holes in this API would be stream_find() and stream_find_overlapping_iter(). I imagine stream_find_overlapping_iter() might not exist because it'd require backtracking (for NFAs only?)—is that worth explaining in the stream_find_iter() doc in case people go looking for it?

I don't think I fully understand what is or isn't possible in the "stream searching" case. If folks need a specific kind of stream searching API, I'd expect them to file an issue describing their use case. (Which I think is true for any API that doesn't exist but that someone wants.)

AhoCorasick::max_pattern_len()

I instinctively wondered whether this was expensive (it's an O(1) getter, albeit behind a vtable). Maybe worth documenting? I'm not sure.

Hmmm. The problem with documenting that this is a cheap getter method is that there are a whole bunch of cheap getter methods. And if this one is documented to be cheap but the others aren't, then that could lead one to conclude that the others aren't cheap. So if I document this one, I kind of need to just slap that on all of the getter methods. And that seems like kind of a bummer to me.

I'll leave this one as-is for now.

I was slightly surprised that these have a different output strategy (return value vs destination buffer). I first saw the method names, then looked at replace_all(), and initially assumed that there wouldn't be a version with a dst argument, since with doesn't suggest that. But I don't have a concrete suggestion, probably fine as is.

Yeah I think it's just more a case of not adding all possible permutations. That is, you can either use the higher level convenience method, or if you want more control over replacements then you also need to supply the destination buffer. Otherwise, I'd need to expand it to add one with just the destination buffer and one with just the closure.

Thanks again for the excellent feedback!

BurntSushi commented 1 year ago

@rljacobson

The only comment I have is that the name of the crate option perf-literal is not very helpful. The description in the documentation doesn't shed much light on why it has the name it has.

Yeah that's a good point. The feature name is derived from the same name of the same feature from the regex crate. In that context, "literal" is something distinct from "regex," but here in aho-corasick, "literal" is not distinct. All patterns are just literals.

I used "literal prefilter" in the docs, but I'm going to keep the name to be consistent with regex.

BurntSushi commented 1 year ago

I already started using this crate in a project and so far I'm enjoying the experience. The API is clear, easy to use, and well documented. These are the only two areas of improvement I've identified:

Thanks! :-)

* In my project I wanted to build the Aho-Corasick automaton and use it simultaneously from multiple threads. The documentation doesn't state whether the automaton is thread-safe, so I had to experiment in order to find out that this is perfectly ok (as I expected) and the `AhoCorasick` type is both `Send` and `Sync`. Thread-safety considerations are worth to be included in the documentation, given that this is possibly a very common use case.

I'm going to pass on this. The reason why is because it's already documented by virtue of being both Send and Sync. That's all there is to it. There aren't any footguns here. You literally cannot misuse an AhoCorasick value in a multi-threaded environment.

I'm not sure if you're new to Rust or not, but if you are, I can see how this might be too subtle and the docs might be a good place to state it explicitly. The problem is that this is just repeating something that is already indicated by the traits it implements. If we were to take your route, the docs for most types would say "this can be used from multiple threads simultaneously." In other words, it just becomes redundant noise.

* I find the method `AhoCorasickBuilder::prefilter` a bit confusing, the description is clear enough, but the name of the method itself doesn't reflect its purpose. I would find something like `AhoCorasickBuilder::use_heuristics` more appropriate for the public API. It looks that this method gets its name from the internal pre-filtering mechanism, but I see prefilters as a building block for search heuristics. In other words, using the pre-filtering mechanism you implement a set of specific pre-filters that together conform your search heuristics.
  When I saw this option I though that the public API had some support for pre-filters, as in the RE2 library, but this is not
  the case. In my opinion the internal term "prefilter" should not be exposed publicly.

I appreciate your perspective here, but prefilter is IMO the right term. I have multiple arguments here: