BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Default automaton #143

Closed xamgore closed 5 months ago

xamgore commented 5 months ago

tldr; I suggest defining an empty automaton struct implementing Automaton trait with semantics like return no matches.

Somewhere in the source code I've seen a line saying it's a nonsense to run automaton with an empty pattern set. While this may be true, the absence of the Default implementation for automaton affects wrapping structs too.

struct Patterns {
    automaton: Arc<dyn Automaton>,
}

impl Default for Patterns {
    fn default() -> Self {
        Self { automaton: ... } // ???
    }
}

impl Automaton for MyDefaultAuto { ... } // sealed!

Of course, the field could be of type Option<Arc<dyn Automaton>>. It means all callers now have to check the presence of the automaton each time they deal with it. The other pitfall is a build method, now you can't just proxy the iterator downwards. We need to check it for emptiness.

fn build<I, P>(patterns: I) -> Result<Arc<dyn Automaton>, BuildError>
where
    I: IntoIterator<Item = P>,
    P: AsRef<[u8]> { ... }

impl Patterns {
    pub fn from_iter<I, P>(patterns: I) -> Result<Self, BuildError> {
        // have to look inside :(
        let patterns = {
            let mut patterns = patterns.into_iter().peekable();
            if patterns.peek_mut().is_none() {
                return Ok(Self { automaton: None });
            }
            patterns
        };

        Ok(Self { automaton: build(patterns) })
    }
}

What I need is just an automaton that does nothing, maybe has a single state, and returns no matches. Can make a PR if everything aforementioned seems reasonable to you.


PS. Thanks for the great job! 😌

BurntSushi commented 5 months ago

Huh? I can't make heads or tails of what you're talking about. Are you just asking for a Default impl?

The automaton should work fine with zero patterns.

BurntSushi commented 5 months ago

Now that I have hands on a keyboard, here's an example:

use aho_corasick::AhoCorasick;

fn main() {
    let ac = AhoCorasick::new(None::<&str>).unwrap();
    dbg!(ac.find("foo"));
}

With output:

$ cargo r -q
[main.rs:5:5] ac.find("foo") = None

It sounds like all you're asking for here is that AhoCorasick implement Default, where the default impl would be zero patterns.

I basically agree that this is a reasonable default, but I am somewhat hesitant to commit to it. I would say that it is odd since it is effectively useless. And I suppose could be surprising in some cases. But I do sympathize with the idea behind making derive(Default) easier for types that own an AhoCorasick.

I suppose it is perhaps somewhat analogous to String::default(), which just gives you the empty string. But maybe the difference there is that you can then add to the String. You can't add to an existing AhoCorasick value.

xamgore commented 5 months ago

The automaton should work fine with zero patterns.

I've extracted StreamChunkIter implementation, updated it to provide a context (stream position, xml stack level, ...). According to your comment, a dyn Automaton with zero patterns returns "match event" on each character. Thus:


Maybe I don't see cons (would like to hear!), and my suggestions are:

  1. At the first step, implement a NoOpAutomaton (single state, no matches).
  2. The second step is to deprecate MatchKind::UnsupportedEmpty, as it's a hidden bomb 💣
    1. Replace it with a no-op automaton or/and
    2. Put aut.min_pattern_len() == 0 into helper functions like try_stream_replace_all, and redirect the rdr into wtr (no patterns means no replaces)
      1. People who care about time, already have put ifs, so they don't face downgrades at all.
BurntSushi commented 5 months ago

Please pop up a level. You're in the implementation weeds. What problem are you trying to solve as a user of this library?

xamgore commented 5 months ago

The automaton should work fine with zero patterns.

Yeah, seems like it does, no matches found. I used to think it wouldn't due to your comment, which is obsolete probably.

use aho_corasick::{Anchored, nfa, automaton::Automaton}; // 1.1.3

fn main() {
    let patterns = Vec::<&str>::new();

    let nfa = nfa::noncontiguous::Builder::default().build(patterns).unwrap();
    let mut sid = nfa.start_state(Anchored::No).unwrap();

    for &byte in "test".as_bytes() {
      sid = nfa.next_state(Anchored::No, sid, byte);
      assert!(!nfa.is_match(sid));
    }
}

Thanks for attention, sorry for taking your time. Default AhoCorasik is just AhoCorasick::new([]).

BurntSushi commented 5 months ago

That comment is still correct. It says empty matches, which is referring to zero-width matches, not "matches for the empty automaton."

xamgore commented 5 months ago

Indeed. 🤦🏻‍♂️