BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Enabling DFA breaks ASCII case insensitivity #51

Closed alexkornitzer closed 4 years ago

alexkornitzer commented 4 years ago

It appears that enabling dfa will stop ascii_case_insensitive from working, or are these two features intended to be mutally exclusive?

The below code will fail with dfa set to true, but will pass with it set to false.

extern crate aho_corasick;

use aho_corasick::AhoCorasickBuilder;

fn main() {
    let patterns = &["Samwise"];
    let haystack = "SAMWISE.abcd";

    let ac = AhoCorasickBuilder::new()
        .ascii_case_insensitive(true)
        .dfa(true)
        .build(patterns);
    let mat = ac.find(haystack).expect("should have a match");
    assert_eq!("SAMWISE", &haystack[mat.start()..mat.end()]);
}

Apologies in advanced if this is in the docs and I am just being unobservant.

BurntSushi commented 4 years ago

Oof. This was a nasty bug. Thank you for reporting it. There was an issue building the equivalence classes during NFA construction, and those are only used via DFA matching. This bug is fixed in aho-corasick 0.7.7. If you want to work around it without upgrading, then you could disable byte classes via byte_classes(false), but I'd recommend upgrading if you can. :-)

alexkornitzer commented 4 years ago

Ah amazing, thanks for the rapid turnaround, super happy this is a bug and not intended behaviour (would have made my life hell :P). Will be updating straight away as DFA shaves of a considerable amount of time in my tagging engine!

alexkornitzer commented 4 years ago

I think that has fixed case_insensitive in RegexBuilder from the regex crate too, which I assume uses dfa by default?

BurntSushi commented 4 years ago

@AlexKornitzer The regex crate is a bit more complicated. It doesn't actually use the ascii_case_insensitive option from this crate. It should, but more analysis work is required to take advantage of it. (regex does Unicode case insensitivity by default, so using ascii_case_insensitive would only work when Unicode is disabled or if ASCII case insensitivity and Unicode case insensitivity would lead to the same result in some specific circumstances.)