Closed nijel closed 1 year ago
Can you provide a small Python program that reproduces the problem?
Okay, it is caused by a blank string (looks like a bug on our side) and some unicode characters.
Minimal reproducer:
import ahocorasick_rs
automaton = ahocorasick_rs.AhoCorasick([""])
print(automaton.find_matches_as_indexes('”'))
It prints:
[(0, 0, 0), (0, 18446744073709551615, 18446744073709551615), (0, 18446744073709551615, 18446744073709551615), (0, 1, 1)]
Yeah it looks like you might be right that it's an issue with this wrapper. I can at least confirm that the underlying Rust library is doing the right thing here:
fn main() -> anyhow::Result<()> {
let ac = aho_corasick::AhoCorasick::new([""]).unwrap();
let hay = "”";
for m in ac.find_iter(hay) {
dbg!(m);
}
Ok(())
}
And running it:
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.00s
Running `target/debug/i83`
[main.rs:5] m = Match {
pattern: PatternID(
0,
),
span: 0..0,
}
[main.rs:5] m = Match {
pattern: PatternID(
0,
),
span: 1..1,
}
[main.rs:5] m = Match {
pattern: PatternID(
0,
),
span: 2..2,
}
[main.rs:5] m = Match {
pattern: PatternID(
0,
),
span: 3..3,
}
The empty string is somewhat of a pathological case that usually requires a special code path.
I think the problem is that matching is byte wise, so it is trying to map matched [0, 1, 2, 3] into a string of length 1. This really cannot happen with a non-blank strings, thanks to the way UTF-8 is encoded.
Thanks for the bug report and reproducer, I will try to get this fixed soon.
I think I will just filter out empty patterns, it's not a useful thing to match on. That will make the storing-patterns code path more annoying, though.
But... filtering them out will distort the results, so guess I won't.
One possible thing you could do: when searching a Unicode string, remove all zero-length matches that split a codepoint. You should probably leave them in when searching a byte string, but matches that split a codepoint when searching a Unicode string are a little weird.
(FWIW, this is what the Rust regex crate does. When searching a &str
, empty matches that split a codepoint are never returned. But they are allowed when searching a &[u8]
.)
Is there any use case for zero-length matches? It just feels like GIGO, a mistake on the user's part when they inserted this sort of pattern.
(As side note, this is plausibly what caused the panic in #51).
For regex? Definitely. For multi-substring matching, it's hard to articulate one off the top of my head. Historically speaking, all single and multi substring search algorithms handle the empty substring correctly. And production implementations (that I'm aware of) do as well. So by choosing not to handle it, you might wind up surprising users. And by choosing not to handle it, you might in turn require callers to handle it themselves. For example, callers might feed user generated patterns into an Aho-Corasick searcher. If your library doesn't handle the case of the empty pattern, then it follows the caller likely needs to handle it somehow. It's a non-obvious and typically pathological case, so it's a fair bet that callers will not know to handle it. From that perspective, it is IMO better to just handle the empty pattern in the library. Aside from the Unicode issue of how to deal with byte offset splitting a codepoint, it has a straight-forward semantic: it should match at every position. Finally, the underlying library you're wrapping explicitly handles the empty pattern. So you might as well do it too. :-)
It is indeed plausible that providing an empty pattern is a mistake. But that mistake can be easily observed when the search ends up matching at every position.
Hm. My current prototype handles it by filtering out all 0-length matches. This doesn't seem like a bad way to handle it but it is inconsistent. I guess I can compare to the "is this invalid UTF-8 boundary" case and see how I feel about it.
I consider this bug in our code and fixed that. There is nothing useful we could do with zero length matches. Constructor failing with a ValueError on such string would work well as it would make it clear that this is wrong. If empty matches are expected to work, filtering out positions inside utf8 might be needed.
I think I'll go with the ValueError, yeah.
My reasoning for posterity—given kinda bogus inputs, the options are:
The first option seemed to have the best tradeoffs in terms of where performance hit is taken and long-term user success.
I don't understand the performance impacts here. You should only ever need to do anything when a pattern and/or match is zero length. Checking for zero length should be nearly free relative to the work being done to produce the match in the first place. If it's zero length, then the overhead of filtering them out should be acceptable IMO because there is already a lot of overhead if you're producing a match at every position.
I think it's bad juju for any multi or single substring implementation to just reject the empty pattern altogether. In my experience this becomes an annoying limitation that needs to be worked around. I'm thinking specifically about the case where one accepts user supplied patterns.
I also disagree that allowing an empty pattern masks bugs in user code. If you supply an empty pattern you should get a match at every position. It's not clear to me how that's masking a bug.
Filtering out all of them is cheap, yeah. Filtering out the ones on bad unicode boundaries requires coming up with that mapping in some cases where currently we don't need to calculate it. Again, probably pretty cheap, but a bit more work.
Mostly this is about user intentions: why would you ever want to match an empty string?
If you the author of a tool have user gave you an empty string, yes you have to handle it, but did your user intended to give an empty string? Probably not meaningful to them either. So you the tool author have to handle it but the upstream user is now getting more meaningful results (or perhaps they had a bug).
Filtering out the ones on bad unicode boundaries requires coming up with that mapping in some cases where currently we don't need to calculate it. Again, probably pretty cheap, but a bit more work.
The key here is that you should only ever have to do this when there's a pattern of zero length. Otherwise a zero length match is impossible.
If you the author of a tool have user gave you an empty string, yes you have to handle it, but did your user intended to give an empty string? Probably not meaningful to them either.
They might have, yes. It's a not-uncommon idiom to do something like grep -e foo -e bar -e ''
to show the entire contents of a file with matches of foo
and bar
highlighted. The empty string at the end ensures that every line matches, but not necessarily every line shows a highlighted match.
Basically, an empty pattern in the context of, say, a Python or Rust program is pretty weird and it is difficult for me to articulate a use case. (Although I would argue that consistency with other multi and substring search implementations should move the needle here in favor of making it work.) But once you move to asking an end user for patterns, then use cases tend to pop up because the end user has a limited interface to work with. Things like an empty pattern can provoke useful behavior in some cases.
So you the tool author have to handle it but the upstream user is now getting more meaningful results (or perhaps they had a bug).
The problem is that they may have a bug. But your change is proposing that it is definitely a bug.
Tool authors are unlikely to handle this case up-front. Pretty much everyone forgets about the empty pattern. I did too. I went through exactly the same dance you're going through. It was just a long time ago.
It's a not-uncommon idiom to do something like
grep -e foo -e bar -e ''
to show the entire contents of a file with matches offoo
andbar
highlighted. The empty string at the end ensures that every line matches, but not necessarily every line shows a highlighted match.
Doesn't this suggest grep should have a --show-all-lines
option?
grep is example. I don't want to argue about what grep itself should or shouldn't support. The point is that this is what happens when patterns become the user interface: users are limited in what they can express and the empty pattern is a useful means of expressing certain idioms without expanding the user interface. Expanding the user interface may be desirable in some cases and not in others. It really just depends. We can't legislate everything here.
I think I'm about spent on this issue. I think I've made my case as well as I can.
I'm going to stick to current plan for now; can always remove the restriction if someone has a sufficiently compelling use case. I do appreciate the feedback, @BurntSushi!
At https://github.com/WeblateOrg/weblate/issues/10002 we observe find_matches_as_indexes to return bogus values (bigger than
sys.maxsize
). While I'm still trying to understand the issue, I looked at the ahocorasick_rs code and I have one question aboutfind_matches_as_indexes
:https://github.com/G-Research/ahocorasick_rs/blob/7bcfe13711ef81a43cbae1ecff87f7d3847a4744/src/lib.rs#L211C1-L231C6
It uses
as_u64
to convert pattern index, whilebyte_to_code_point
to convert start and end positions. Couldn't this be the place where something could go wrong? I would expect all to be converted as integers.