BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Default `anchored` value in a new `Input`. #114

Closed rljacobson closed 1 year ago

rljacobson commented 1 year ago

The default anchored state for a new Input is confusing, at least in the case that the Input is created automagically from, say, a &str. Consider:

use aho_corasick::{AhoCorasick, StartKind};

fn main() {
    let ac = AhoCorasick::builder()
        .start_kind(StartKind::Anchored)
        .build(&["a", "cab", "bc"])
        .unwrap();
    let haystack = "abcd";

    match ac.try_find(haystack) {
        Ok(Some(_)) => {
            // Expected
            println!("Success");
        }
        Ok(None) => {
            println!("Failed to find a match");
        }
        Err(e) => {
            eprintln!("ERROR: {}", e); // ERROR: unanchored searches are not supported or enabled
        }
    }
}

Despite having explicitly set the start_kind to StartKind::Anchored, the search uses the default of the implicitly created Input object, which is Anchored::No. As a user, I would expect a call to try_find(…) with a &str to default to the start_kind that was explicitly set on the matcher.

I'm reciting my understanding of what's going on here in case I am seriously misunderstanding it, so please forgive the unnecessary detail.

Different constructions are required for the different start_kind behaviors of AhoCorasick, and so the StartKind enum has three variants, Anchored, Unanchored, and Both, corresponding to which constructions the object uses. Given that an AhoCorasick might support either start behavior, it makes sense for an Input to specify which behavior to use.

The docs are fairly clear that automatic conversion from &str to Input uses the default Anchored variant and that the default Anchored variant is Anchored::No (e.g. here). So this is definitely a user error.

The downside is that client code is required to keep the start behavior of the Input in sync with the AhoCorasick even when it is implicitly determined. Would one of these alternatives make more sense?

  1. The Anchored enum has an Anchored::Default variant which defers to the AhoCorasick. Under this scheme, the default behavior is moved out of the Input object and into the AhoCorasick object, and the default is only used when client code does not explicitly specify the start behavior and the AhoCorasick supports both start behaviors.
  2. The Anchored has an Anchored::Deferred variant, but the default for Input remains Anchored::No (in Input::new(…)) except when a T: AsRef<[u8]> is converted to an Input, in which case Anchored::Deferred is used as the default for the Input. This scheme is a bit more complex, as there are now three "defaults":
    1. Anchored:No when a variant is not specified for a new explicitly created Input (as it is now);
    2. Anchored::Deferred when a T: AsRef<[u8]> is converted to an Input;
    3. StartKind::Unanchored for an AhoCorasick that is created without StartKind being set explicitly (as it is now).

The advantage of scheme (1) is that you retain the unanchored default start behavior but throw an error or panic in strictly fewer cases. The advantage of (2) is that fewer things change from the perspective of the API user, but at the cost of automatic conversion from a T: AsRef<[u8]> being treated as a "special case", which might feel inconsistent to some people. Both have the advantage that less needs to be known by the programmer about the details of the API, IMHO. Also, in both alternatives, a third variant is added to the Anchored enum, so there's the opportunity to unify the Anchored and StartKind enums into a single enum—but some care and creativity would be required in naming the third variant.

Even if you agree, I appreciate the fact that it's an API change to a very mature crate and so might not be worth doing on those grounds alone.

BurntSushi commented 1 year ago

I think I'm going to keep things how they are, but if there's some added documentation that would have helped make this clearer, I'm totally on board with that.

There are a few reasons why:

Basically, I acknowledge that you hit an unexpected case. But you got a loud failure about it. IMO, that's the API design working as intended. You didn't get subtly different match semantics or a silent failure. It hit you in the face. That's what I was hoping would happen, if that makes sense. :-)

rljacobson commented 1 year ago

The StartKind configuration is not part of the type of an AhoCorasick automaton.

I think this alone is reason enough to keep it as it is.

But you got a loud failure about it. IMO, that's the API design working as intended.

Yes, agreed. In the kinds of programs I write, anchored searches are the most common use cases. But obviously my expectations shouldn't drive API design if I'm not the typical case. And like I said, the docs are pretty clear.