elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
70.93k stars 24.9k forks source link

Grok lib fails to match pattern when using offset with partial match #95002

Closed luigidellaquila closed 1 year ago

luigidellaquila commented 1 year ago

Grok.match(bytes, offset, length, extracter) API seems to have problems when using a positive offset together with a pattern that matches the input as a suffix (ie. not the beginning, but discarding a prefix).

Eg. with a pattern like %{WORD:a} %{WORD:b} %{NUMBER:c:int} and an input string like x1 a1 b1 12, the pattern is supposed to discard x1 and match the rest of the input.

Everything works fine as long as the string is stored in a byte[] alone, and the offset used is 0. If the input of match() is a bigger array, where the string is in the middle and an offset > 0 is used, then the match fails.

Here is a code snippet that reproduces the problem.

Grok grok = new Grok(GrokBuiltinPatterns.get(true), "%{WORD:a} %{WORD:b} %{NUMBER:c:int}", logger::warn);
byte[] utf8 = "x1 a1 b1 12 x2 a2 b2 13 ".getBytes(StandardCharsets.UTF_8);
// this succeeds
assertThat(captureBytes(grok, utf8, 0, 12), equalTo(Map.of("a", "a1", "b", "b1", "c", 12)));
// this fails, see offset 12
assertThat(captureBytes(grok, utf8, 12, 12), equalTo(Map.of("a", "a2", "b", "b2", "c", 13)));
private Map<String, Object> captureBytes(Grok grok, byte[] utf8, int offset, int length) {
    GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(grok.captureConfig());
    if (grok.match(utf8, offset, length, extracter)) {
        return extracter.result();
    }
    return null;
}

[edit] added the same test in https://github.com/elastic/elasticsearch/pull/95003

Checking the library code, the problem could depend on the usage of Matcher.search(), where we are passing matcher.search(offset, length, Option.DEFAULT) Since for Regex.matcher() we are passing offset and offset + length (instead of offset, length), it's possible that also matcher.search() should accept offset + length.

Joni API documentation is very basic and it's absolutely not clear about this specific API and the problem never happened at higher level because we are always using that API with offset = 0.

The fix proposed above solves the problem and passes all the tests, but it would be good to have a confirmation from someone with more expertise on this lib.

PR following with the proposed fix

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)