G-Research / ahocorasick_rs

Check for multiple patterns in a single string at the same time: a fast Aho-Corasick algorithm for Python
Apache License 2.0
160 stars 12 forks source link

find_matches_as_indexes returns bogus values #83

Closed nijel closed 1 year ago

nijel commented 1 year ago

At https://github.com/WeblateOrg/weblate/issues/10002 we observe find_matches_as_indexes to return bogus values (bigger than sys.maxsize). While I'm still trying to understand the issue, I looked at the ahocorasick_rs code and I have one question about find_matches_as_indexes:

https://github.com/G-Research/ahocorasick_rs/blob/7bcfe13711ef81a43cbae1ecff87f7d3847a4744/src/lib.rs#L211C1-L231C6

It uses as_u64 to convert pattern index, while byte_to_code_point to convert start and end positions. Couldn't this be the place where something could go wrong? I would expect all to be converted as integers.

BurntSushi commented 1 year ago

Can you provide a small Python program that reproduces the problem?

nijel commented 1 year ago

Okay, it is caused by a blank string (looks like a bug on our side) and some unicode characters.

Minimal reproducer:

import ahocorasick_rs

automaton = ahocorasick_rs.AhoCorasick([""])

print(automaton.find_matches_as_indexes('”'))

It prints:

[(0, 0, 0), (0, 18446744073709551615, 18446744073709551615), (0, 18446744073709551615, 18446744073709551615), (0, 1, 1)]
BurntSushi commented 1 year ago

Yeah it looks like you might be right that it's an issue with this wrapper. I can at least confirm that the underlying Rust library is doing the right thing here:

fn main() -> anyhow::Result<()> {
    let ac = aho_corasick::AhoCorasick::new([""]).unwrap();
    let hay = "”";
    for m in ac.find_iter(hay) {
        dbg!(m);
    }
    Ok(())
}

And running it:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/i83`
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 0..0,
}
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 1..1,
}
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 2..2,
}
[main.rs:5] m = Match {
    pattern: PatternID(
        0,
    ),
    span: 3..3,
}

The empty string is somewhat of a pathological case that usually requires a special code path.

nijel commented 1 year ago

I think the problem is that matching is byte wise, so it is trying to map matched [0, 1, 2, 3] into a string of length 1. This really cannot happen with a non-blank strings, thanks to the way UTF-8 is encoded.

itamarst commented 1 year ago

Thanks for the bug report and reproducer, I will try to get this fixed soon.

itamarst commented 1 year ago

I think I will just filter out empty patterns, it's not a useful thing to match on. That will make the storing-patterns code path more annoying, though.

itamarst commented 1 year ago

But... filtering them out will distort the results, so guess I won't.

BurntSushi commented 1 year ago

One possible thing you could do: when searching a Unicode string, remove all zero-length matches that split a codepoint. You should probably leave them in when searching a byte string, but matches that split a codepoint when searching a Unicode string are a little weird.

(FWIW, this is what the Rust regex crate does. When searching a &str, empty matches that split a codepoint are never returned. But they are allowed when searching a &[u8].)

itamarst commented 1 year ago

Is there any use case for zero-length matches? It just feels like GIGO, a mistake on the user's part when they inserted this sort of pattern.

itamarst commented 1 year ago

(As side note, this is plausibly what caused the panic in #51).

BurntSushi commented 1 year ago

For regex? Definitely. For multi-substring matching, it's hard to articulate one off the top of my head. Historically speaking, all single and multi substring search algorithms handle the empty substring correctly. And production implementations (that I'm aware of) do as well. So by choosing not to handle it, you might wind up surprising users. And by choosing not to handle it, you might in turn require callers to handle it themselves. For example, callers might feed user generated patterns into an Aho-Corasick searcher. If your library doesn't handle the case of the empty pattern, then it follows the caller likely needs to handle it somehow. It's a non-obvious and typically pathological case, so it's a fair bet that callers will not know to handle it. From that perspective, it is IMO better to just handle the empty pattern in the library. Aside from the Unicode issue of how to deal with byte offset splitting a codepoint, it has a straight-forward semantic: it should match at every position. Finally, the underlying library you're wrapping explicitly handles the empty pattern. So you might as well do it too. :-)

It is indeed plausible that providing an empty pattern is a mistake. But that mistake can be easily observed when the search ends up matching at every position.

itamarst commented 1 year ago

Hm. My current prototype handles it by filtering out all 0-length matches. This doesn't seem like a bad way to handle it but it is inconsistent. I guess I can compare to the "is this invalid UTF-8 boundary" case and see how I feel about it.

nijel commented 1 year ago

I consider this bug in our code and fixed that. There is nothing useful we could do with zero length matches. Constructor failing with a ValueError on such string would work well as it would make it clear that this is wrong. If empty matches are expected to work, filtering out positions inside utf8 might be needed.

itamarst commented 1 year ago

I think I'll go with the ValueError, yeah.

itamarst commented 1 year ago

My reasoning for posterity—given kinda bogus inputs, the options are:

The first option seemed to have the best tradeoffs in terms of where performance hit is taken and long-term user success.

BurntSushi commented 1 year ago

I don't understand the performance impacts here. You should only ever need to do anything when a pattern and/or match is zero length. Checking for zero length should be nearly free relative to the work being done to produce the match in the first place. If it's zero length, then the overhead of filtering them out should be acceptable IMO because there is already a lot of overhead if you're producing a match at every position.

I think it's bad juju for any multi or single substring implementation to just reject the empty pattern altogether. In my experience this becomes an annoying limitation that needs to be worked around. I'm thinking specifically about the case where one accepts user supplied patterns.

I also disagree that allowing an empty pattern masks bugs in user code. If you supply an empty pattern you should get a match at every position. It's not clear to me how that's masking a bug.

itamarst commented 1 year ago

Filtering out all of them is cheap, yeah. Filtering out the ones on bad unicode boundaries requires coming up with that mapping in some cases where currently we don't need to calculate it. Again, probably pretty cheap, but a bit more work.

Mostly this is about user intentions: why would you ever want to match an empty string?

itamarst commented 1 year ago

If you the author of a tool have user gave you an empty string, yes you have to handle it, but did your user intended to give an empty string? Probably not meaningful to them either. So you the tool author have to handle it but the upstream user is now getting more meaningful results (or perhaps they had a bug).

BurntSushi commented 1 year ago

Filtering out the ones on bad unicode boundaries requires coming up with that mapping in some cases where currently we don't need to calculate it. Again, probably pretty cheap, but a bit more work.

The key here is that you should only ever have to do this when there's a pattern of zero length. Otherwise a zero length match is impossible.

If you the author of a tool have user gave you an empty string, yes you have to handle it, but did your user intended to give an empty string? Probably not meaningful to them either.

They might have, yes. It's a not-uncommon idiom to do something like grep -e foo -e bar -e '' to show the entire contents of a file with matches of foo and bar highlighted. The empty string at the end ensures that every line matches, but not necessarily every line shows a highlighted match.

Basically, an empty pattern in the context of, say, a Python or Rust program is pretty weird and it is difficult for me to articulate a use case. (Although I would argue that consistency with other multi and substring search implementations should move the needle here in favor of making it work.) But once you move to asking an end user for patterns, then use cases tend to pop up because the end user has a limited interface to work with. Things like an empty pattern can provoke useful behavior in some cases.

So you the tool author have to handle it but the upstream user is now getting more meaningful results (or perhaps they had a bug).

The problem is that they may have a bug. But your change is proposing that it is definitely a bug.

Tool authors are unlikely to handle this case up-front. Pretty much everyone forgets about the empty pattern. I did too. I went through exactly the same dance you're going through. It was just a long time ago.

itamarst commented 1 year ago

It's a not-uncommon idiom to do something like grep -e foo -e bar -e '' to show the entire contents of a file with matches of foo and bar highlighted. The empty string at the end ensures that every line matches, but not necessarily every line shows a highlighted match.

Doesn't this suggest grep should have a --show-all-lines option?

BurntSushi commented 1 year ago

grep is example. I don't want to argue about what grep itself should or shouldn't support. The point is that this is what happens when patterns become the user interface: users are limited in what they can express and the empty pattern is a useful means of expressing certain idioms without expanding the user interface. Expanding the user interface may be desirable in some cases and not in others. It really just depends. We can't legislate everything here.

I think I'm about spent on this issue. I think I've made my case as well as I can.

itamarst commented 1 year ago

I'm going to stick to current plan for now; can always remove the restriction if someone has a sufficiently compelling use case. I do appreciate the feedback, @BurntSushi!