fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

Long Regex Fails To Parse #1104

Closed davidmccormick closed 5 years ago

davidmccormick commented 5 years ago

Bug Report

Describe the bug I'm getting and error message loading as parser with a long regular expression: -

Fluent Bit v1.0.4
Copyright (C) Treasure Data

[2019/02/12 17:04:50] [error] [parser:tomcat_access] Invalid regex pattern ^(domain=(?<domain>\S+))?\s*\[(?<datetime>\d+\/\w+\/\d+:\d+:\d+:\d+\s+\+\d+)\]\s+(domain="?(?<domain>\S+)"?\s*|SiteSpectEngine="?(?<SiteSpectEngine>[^\s"]+)"?|X-Forwarded-Host="?(?<X-Forwarded-Host>[^\s"]+)"?|X-Forwarded-Server="?(?<X-Forwarded-Server>[^\s"]+)"?|X-NS-Forwarded-Server="?(?<X-NS-Forwarded-Server>[^\s"]+)"?|ajax="?(?<ajax>[^\s"]+)"?|akamai_reqID="?(?<akamai_reqID>[^\s"]+)"?|bytes_sent="?(?<bytes_sent>[^\s"]+)"?|client_id="?(?<client_id>[^\s"]+)"?|duration_ms="?(?<duration_ms>[^\s"]+)"?|duration_to_commit="?(?<duration_to_commit>[^\s"]+)"?|edgescape="?(?<edgescape>[^\s"]+)"?|guid="?(?<guid>[^\s"]+)"?|http_method="?(?<http_method>[^\s"]+)"?|native_app="?(?<native_app>[^\s"]+)"?|redirect="?(?<redirect>[^\s"]+)"?|remote_host="?(?<remote_host>[^\s"]+)"?|req_guid="?(?<req_guid>[^\s"]+)"?|rid="?(?<rid>[^\s"]+)"?|sessid="?(?<sessid>[^\s"]+)"?|statuscode="?(?<statuscode>[^\s"]+)"?|thread="?(?<thread>[^\s"]+)"?|true_clie
[2019/02/12 17:04:50] [ info] [storage] initializing...
[2019/02/12 17:04:50] [ info] [storage] in-memory
[2019/02/12 17:04:50] [ info] [storage] normal synchronization mode, checksum disabled
[2019/02/12 17:04:50] [ info] [engine] started (pid=1)
[2019/02/12 17:04:50] [error] [in_tail] parser 'tomcat_access' is not registered
[2019/02/12 17:04:50] [error] [in_tail] parser 'hcom_common_log' is not registered
[2019/02/12 17:04:50] [error] [filter_parser] requested parser 'locale' not found
[2019/02/12 17:04:50] [error] [filter_parser] Invalid "parser"

[2019/02/12 17:04:50] [error] Failed initialize filter parser.0
[2019/02/12 17:04:50] [error] [filter_parser] requested parser 'hcom_pos' not found
[2019/02/12 17:04:50] [error] [filter_parser] Invalid "parser"

[2019/02/12 17:04:50] [error] Failed initialize filter parser.1
[2019/02/12 17:04:50] [error] [filter_parser] requested parser 'message' not found
[2019/02/12 17:04:50] [error] [filter_parser] Invalid "parser"

[2019/02/12 17:04:50] [error] Failed initialize filter parser.2
[2019/02/12 17:04:50] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2019/02/12 17:04:50] [ info] [filter_kube] local POD info OK

It looks as though the original expression has been truncated, causing the syntax error. When I shorten the regex then the compile error goes away (unless I have just removed the bit it does not like)

To Reproduce

domain=localhost:8080 [05/Feb/2019:13:20:12 +0000] remote_host=- ajax=- http_method=GET url=/version.txt redirect=- statuscode=200 duration_ms=278 bytes_sent=373 referer=- user_agent=Go-http-client/1.1 sessid=- edgescape=- guid=- req_guid=ShoppingApp-SA.2019.1.feature_SHP_46422_Populate_requests_to_PROS.8;fc8f511b-f71a-4c95-9bb2-b417c5d9bde4;32 nativeApp=- X-Forwarded-Host=- X-Forwarded-Server=- X-NS-Forwarded-Server=- SiteSpectEngine=- akamai_reqID=-

Using a config: -

    [INPUT]
        Name              tail
        Tag               tomcat_access.*
        Path              /logs/tomcat_access.log
        Parser            tomcat_access

With the parser: -

    [PARSER]
        Name         tomcat_access
        Format       regex
        Regex        ^(domain=(?<domain1>\S+))?\s*\[(?<datetime>\d+\/\w+\/\d+:\d+:\d+:\d+\s+\+\d+)\]\s+(domain="?(?<domain2>\S+)"?\s*|SiteSpectEngine="?(?<SiteSpectEngine>[^\s"]+)"?|X-Forwarded-Host="?(?<X-Forwarded-Host>[^\s"]+)"?|X-Forwarded-Server="?(?<X-Forwarded-Server>[^\s"]+)"?|X-NS-Forwarded-Server="?(?<X-NS-Forwarded-Server>[^\s"]+)"?|ajax="?(?<ajax>[^\s"]+)"?|akamai_reqID="?(?<akamai_reqID>[^\s"]+)"?|bytes_sent="?(?<bytes_sent>[^\s"]+)"?|client_id="?(?<client_id>[^\s"]+)"?|duration_ms="?(?<duration_ms>[^\s"]+)"?|duration_to_commit="?(?<duration_to_commit>[^\s"]+)"?|edgescape="?(?<edgescape>[^\s"]+)"?|guid="?(?<guid>[^\s"]+)"?|http_method="?(?<http_method>[^\s"]+)"?|native_app="?(?<native_app>[^\s"]+)"?|redirect="?(?<redirect>[^\s"]+)"?|remote_host="?(?<remote_host>[^\s"]+)"?|req_guid="?(?<req_guid>[^\s"]+)"?|rid="?(?<rid>[^\s"]+)"?|sessid="?(?<sessid>[^\s"]+)"?|statuscode="?(?<statuscode>[^\s"]+)"?|thread="?(?<thread>[^\s"]+)"?|true_client_ip="?(?<true_client_ip>[^\s"]+)"?|url="?(?<url>[^\s"]+)"?|user_agent="?(?<user_agent>[^\s"]+)"?|\s+|\S+="?\S+"?)+
        Time_Key     datetime
        Time_Format  %d/%b/%Y:%H:%M:%S

With example data in /logs/tomcat_access.log

Expected behavior Regex will compile and log elements get correctly parsed into their constituent fields.

Additional context We are using fluentbit for collecting kubernetes logs, systemd, container logs and as a side-car for applications to capture file based logs. We forward all of the logs to a fluent-bit instance running as a forwarder with the splunk output plugin.

This issue affects the collection of logs for our tomcat applications - because we have a lot of them with different format logs we want a flexible regex that is able to account for these differences.

davidmccormick commented 5 years ago

Hmm looking at the code and adding some more logging, it looks like it is the logging function that is truncating the line and not the actual regex pattern itself. So the issue looks to be that rubular.com is happy with the Regex but fluent-bit is not.

davidmccormick commented 5 years ago

More testing has shown it is the dashes in some of the capture group names which fluent-bit does not like!

fyankee commented 2 years ago

For this bug, is there any solution to fix it. I found that cannot collect nginx log for below pattern ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$

Afsalmc commented 1 year ago

@davidmccormick Just wow ! removing dashes ( - ) from my field names in regex solved my problems.

for example make ^(?<remote>[^ ]*) (?<host-name>[^ ]*) to ^(?<remote>[^ ]*) (?<hostName>[^ ]*)

just remove dashes ( - ). That's it. people like @davidmccormick make this community awesome. Thanks a lot !

DesireWithin commented 1 year ago

I found that _ and @ also cause the same issue.

Michael-S commented 9 months ago

I’m glad I found this! It saved me a lot of trouble. (edit: though I see a warning is listed at the bottom of https://docs.fluentbit.io/manual/pipeline/parsers/regular-expression )