multiline parsing does not work when parser is used

pmeier commented 3 months ago

Bug Report

The documentation of the filter_multiline plugin highly recommends to use the multiline support of the input_tail plugin if used. This is done by using the multiline.parser option. Here is how this looks for a reduced version of the official example

configuration files

#### `fluent-bit.conf` ``` [SERVICE] flush 1 log_level info parsers_file parsers_multiline.conf [INPUT] name tail path test.log read_from_head true multiline.parser multiline-regex-test [OUTPUT] name stdout match * ``` #### `parsers_multiline.conf` ``` [MULTILINE_PARSER] name multiline-regex-test type regex flush_timeout 1000 # Regex rules for multiline parsing # --------------------------------- # # configuration hints: # # - first state always has the name: start_state # - every field in the rule must be inside double quotes # # rules | state name | regex pattern | next state name # --------|----------------|-------------------------------------------------- rule "start_state" "/(Dec \d+ \d+\:\d+\:\d+)(.*)/" "cont" rule "cont" "/^\s+at.*/" "cont" ```

`test.log`

single line...
Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!
    at com.myproject.module.MyProject.badMethod(MyProject.java:22)
    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
    at com.myproject.module.MyProject.someMethod(MyProject.java:10)
    at com.myproject.module.MyProject.main(MyProject.java:6)
another line...

output

[0] tail.0: [[1721287650.926202485, {}], {"log"=>"single line...
"}]
[0] tail.0: [[1721287650.926237461, {}], {"log"=>"Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!
    at com.myproject.module.MyProject.badMethod(MyProject.java:22)
    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
    at com.myproject.module.MyProject.someMethod(MyProject.java:10)
    at com.myproject.module.MyProject.main(MyProject.java:6)
"}]

The multiline parsing works fine here, although the last log line (another line ...) is swallowed (see #8623). Plus I'm wondering why we have the same index (?), i.e. [0] for both records.

It breaks down if one needs to use a parser before the multiline parser is applied. Per documentation, this should be configured by the parser and key_content option on the multiline parser itself.

configuration files

#### `fluent-bit.conf` ``` [SERVICE] flush 1 log_level info parsers_file parsers_multiline.conf [INPUT] name tail path test_docker.log read_from_head true multiline.parser multiline-regex-test [OUTPUT] name stdout match * ``` #### `parsers_multiline.conf` ``` [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZ [MULTILINE_PARSER] name multiline-regex-test type regex flush_timeout 1000 parser docker key_content log # Regex rules for multiline parsing # --------------------------------- # # configuration hints: # # - first state always has the name: start_state # - every field in the rule must be inside double quotes # # rules | state name | regex pattern | next state name # --------|----------------|-------------------------------------------------- rule "start_state" "/(Dec \d+ \d+\:\d+\:\d+)(.*)/" "cont" rule "cont" "/^\s+at.*/" "cont" ```

I'm using the documented example for parsing docker logs and just wrapped the individual lines of test.log into the docker logs format:

`test_docker.log`

{"log": "single line...\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962740Z"}
{"log": "Dec 14 06:41:08 Exception in thread \"main\" java.lang.RuntimeException: Something has gone wrong, aborting!\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962777Z"}
{"log": "    at com.myproject.module.MyProject.badMethod(MyProject.java:22)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962788Z"}
{"log": "    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962795Z"}
{"log": "    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962802Z"}
{"log": "    at com.myproject.module.MyProject.someMethod(MyProject.java:10)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962808Z"}
{"log": "    at com.myproject.module.MyProject.main(MyProject.java:6)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962814Z"}
{"log": "another line...", "stream": "stdout", "time": "2024-07-17T14:24:00.962825Z"}

output

[0] tail.0: [[1721288427.742631829, {}], {"log"=>"{"log": "single line...\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962740Z"}"}]
[1] tail.0: [[1721226240.962777000, {}], {"log"=>"Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!
    at com.myproject.module.MyProject.badMethod(MyProject.java:22)
    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
    at com.myproject.module.MyProject.someMethod(MyProject.java:10)
    at com.myproject.module.MyProject.main(MyProject.java:6)
", "stream"=>"stdout"}]
[2] tail.0: [[1721226240.962777000, {}], {"log"=>"{"log": "another line...", "stream": "stdout", "time": "2024-07-17T14:24:00.962825Z"}"}]

So the multiline parsing still works, but for some reason the single lines have the whole input record nested under the "log" key.

Curiously, if I just put the parser in the input_tail plugin and insert a filter_multiline plugin, everything works fine:

configuration files

#### `fluent-bit.conf` ``` [SERVICE] flush 1 log_level info parsers_file parsers_multiline.conf [INPUT] name tail path test_docker.log read_from_head true parser docker [FILTER] name multiline match * multiline.key_content log multiline.parser multiline-regex-test [OUTPUT] name stdout match * ``` #### `parsers_multiline.conf` ``` [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZ [MULTILINE_PARSER] name multiline-regex-test type regex # Regex rules for multiline parsing # --------------------------------- # # configuration hints: # # - first state always has the name: start_state # - every field in the rule must be inside double quotes # # rules | state name | regex pattern | next state name # --------|----------------|-------------------------------------------------- rule "start_state" "/(Dec \d+ \d+\:\d+\:\d+)(.*)/" "cont" rule "cont" "/^\s+at.*/" "cont" ```

`test_docker.log`

{"log": "single line...\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962740Z"}
{"log": "Dec 14 06:41:08 Exception in thread \"main\" java.lang.RuntimeException: Something has gone wrong, aborting!\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962777Z"}
{"log": "    at com.myproject.module.MyProject.badMethod(MyProject.java:22)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962788Z"}
{"log": "    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962795Z"}
{"log": "    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962802Z"}
{"log": "    at com.myproject.module.MyProject.someMethod(MyProject.java:10)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962808Z"}
{"log": "    at com.myproject.module.MyProject.main(MyProject.java:6)\n", "stream": "stdout", "time": "2024-07-17T14:24:00.962814Z"}
{"log": "another line...", "stream": "stdout", "time": "2024-07-17T14:24:00.962825Z"}

output

[0] tail.0: [[1721226240.962740000, {}], {"log"=>"single line...
", "stream"=>"stdout"}]
[1] tail.0: [[1721226240.962777000, {}], {"log"=>"Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!
    at com.myproject.module.MyProject.badMethod(MyProject.java:22)
    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
    at com.myproject.module.MyProject.someMethod(MyProject.java:10)
    at com.myproject.module.MyProject.main(MyProject.java:6)
", "stream"=>"stdout"}]
[2] tail.0: [[1721226240.962825000, {}], {"log"=>"another line...", "stream"=>"stdout"}]

Expected behavior

Setting a multiline parser with parser on the input_tail plugin should work exactly as only setting a parser and using a filter_multiline plugin afterwards.

Your Environment

Version used: 3.0.6 / 3.1.0
Configuration: see above
Environment name and version (e.g. Kubernetes? What version?): Kubernetes
Operating System and version: Arch / Ubuntu
Filters and plugins: see above

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

pmeier commented 1 week ago

Still relevant.

fluent / fluent-bit