fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.89k stars 1.34k forks source link

Fluentd stuck/hangs because of infinity regexp (99.9%) please improve detection/validation #2464

Closed kvborodin closed 5 years ago

kvborodin commented 5 years ago

First of all I'd like to thanks to all developers who made this piece of software works at least 23 hours in production without any issues. My issue is happened after 24 hours.

Describe the bug I'm using 'multi_format' plugin inside 'tail' source, which is made by Mr @repeatedly

My "parse".."/parse" code inside 'tail' plugin looks like this structure, I have 10-15 various expressions to catch my multi-format logs:

    # PATTERN: pq_rules_log
    <pattern>
      format regexp
      expression /(?<sql1>\s*(INSERT|REPLACE)\s+(DELAYED\s*|IGNORE\s*)?INTO)\s+(?<db>[^\s]+?)\.(?<table>.+?)[(\s]((.*)(?<decider>filter_from)(.*))VALUES\s*\(FROM_UNIXTIME\((?<logged>\d+)\),\s\'(?<api_key>[^,]+)\',\s?(?<api_key_id>[^,]+),\s\'(?<page_id>[^,]+)\',\s\'(?<action>[^,]+)\',\s\'(?<query>.*)\',\s\'(?<tag>[^,]*)\',\s\'(?<query_id>[^']*)\',\s\'(?<api_section>[^,]+)\',\s\'(?<filter_from>[^,]+)\',\s\'(?<rule_type>[^,]*)\'/im
    </pattern>

Every day I'm having production stuck/hang fluentd process which consumes 100% cpu But it's not just about consuming high CPU, fluentd just stops consuming/tailing my logs same time, which is horrible for my 200 production bare-metal servers, you know. I have only 2 log files at the same time, just 2 files which I want this software to consume:

 @type tail
  @log_level trace
  read_lines_limit 50
  enable_stat_watcher false
  refresh_interval 600
  path /boardreader/log/performance/log_*
  pos_file /boardreader/log/fluentd_all.log.pos

When Fluentd hang/stuck I am not able to get report by sending SIGCONT signal to PID Strace shows me same data even if I strace well-working fluentd or stuked:

futex(0x7f6e116274b4, FUTEX_WAIT_BITSET_PRIVATE, 346209, {28652557, 138109776}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f6e11627528, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f6e116274b4, FUTEX_WAIT_BITSET_PRIVATE, 346211, {28652557, 438309940}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f6e11627528, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f6e116274b4, FUTEX_WAIT_BITSET_PRIVATE, 346213, {28652557, 738578707}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f6e11627528, FUTEX_WAKE_PRIVATE, 1) = 0

But pstack shows me interesting information every time even if I stop/start fluentd:

look 0 and 1 lines

Thread 3 (Thread 0x7f45513ff700 (LWP 20109)):
#0  0x00007f4567024a63 in match_at (reg=reg@entry=0x7f455a3a6300, str=str@entry=0x7f454e2f7b80 "INSERT DELAYED INTO logs.pq_rules_log_190618 (`logged`, `api_key`, `api_key_id`, `page_id`, `action`, `query`, `tag`, `query_id`, `api_section`, `filter_from`, `rule_type`)", ' ' <repeats 17 times>, "VALUES (FRO"..., end=end@entry=0x7f454e325bcb "", sstart=sstart@entry=0x7f454e2f7b80 "INSERT DELAYED INTO logs.pq_rules_log_190618 (`logged`, `api_key`, `api_key_id`, `page_id`, `action`, `query`, `tag`, `query_id`, `api_section`, `filter_from`, `rule_type`)", ' ' <repeats 17 times>, "VALUES (FRO"..., sprev=<optimized out>, msa=msa@entry=0x7f45513fa8d0) at regexec.c:3139
#1  0x00007f4567027b7a in onig_search_gpos (reg=reg@entry=0x7f455a3a6300, str=0x7f454e2f7b80 "INSERT DELAYED INTO logs.pq_rules_log_190618 (`logged`, `api_key`, `api_key_id`, `page_id`, `action`, `query`, `tag`, `query_id`, `api_section`, `filter_from`, `rule_type`)", ' ' <repeats 17 times>, "VALUES (FRO"..., end=0x7f454e325bcb "", global_pos=<optimized out>, start=<optimized out>, range=range@entry=0x7f454e325bcb "", region=region@entry=0x7f45513fa9a0, option=option@entry=0) at regexec.c:4416
#2  0x00007f456702812b in onig_search (reg=reg@entry=0x7f455a3a6300, str=<optimized out>, end=<optimized out>, start=<optimized out>, range=range@entry=0x7f454e325bcb "", region=region@entry=0x7f45513fa9a0, option=option@entry=0) at regexec.c:4145
#3  0x00007f456700e7b2 in rb_reg_search0 (set_backref_str=1, reverse=<optimized out>, pos=0, str=139935792957960, re=139935871612520) at re.c:1531
#4  rb_reg_search (re=re@entry=139935871612520, str=str@entry=139935792957960, pos=pos@entry=0, reverse=reverse@entry=0) at re.c:1588
#5  0x00007f456700ec3f in reg_match_pos (pos=0, strp=0x7f45513faa70, re=139935871612520) at re.c:3022
#6  rb_reg_match_m (argc=<optimized out>, argv=<optimized out>, re=139935871612520) at re.c:3197
#7  0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f455a42f580, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150d9b0, th=0x7f45593ac000) at vm_insnhelper.c:1769
#8  vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150d9b0, calling=<optimized out>, ci=0x7f455a42f580, cc=<optimized out>) at vm_insnhelper.c:1864
#9  0x00007f45670b6229 in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:1066
#10 0x00007f45670ba647 in vm_exec (th=th@entry=0x7f45593ac000) at vm.c:1775
#11 0x00007f45670bb201 in invoke_block (captured=<optimized out>, opt_pc=<optimized out>, type=572653569, cref=0x0, self=139935865021480, iseq=0x7f455a44d238, th=0x7f45593ac000) at vm.c:973
#12 invoke_iseq_block_from_c (th=0x7f45593ac000, captured=<optimized out>, self=139935865021480, argc=<optimized out>, argv=<optimized out>, passed_block_handler=<optimized out>, cref=0x0, splattable=0, is_lambda=0) at vm.c:1019
#13 0x00007f45670bb8bd in invoke_block_from_c_splattable (argc=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>, splattable=<optimized out>, argv=<optimized out>, block_handler=<optimized out>, th=<optimized out>) at vm.c:1037
#14 vm_yield (argc=1, argv=0x7f45513fb168, th=0x7f45593ac000) at vm.c:1079
#15 rb_yield_0 (argv=0x7f45513fb168, argc=1) at vm_eval.c:1010
#16 rb_yield_1 (val=139935871666800) at vm_eval.c:1016
#17 rb_yield (val=<optimized out>) at vm_eval.c:1026
#18 0x00007f4566ed6b5d in rb_ary_each (ary=139935865013440) at array.c:1824
#19 0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f4559316320, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150da40, th=0x7f45593ac000) at vm_insnhelper.c:1769
#20 vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150da40, calling=<optimized out>, ci=0x7f4559316320, cc=<optimized out>) at vm_insnhelper.c:1864
#21 0x00007f45670b6ece in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:967
#22 0x00007f45670ba647 in vm_exec (th=th@entry=0x7f45593ac000) at vm.c:1775
#23 0x00007f45670bb201 in invoke_block (captured=<optimized out>, opt_pc=<optimized out>, type=572653569, cref=0x0, self=139935844059400, iseq=0x7f455a4df7f0, th=0x7f45593ac000) at vm.c:973
#24 invoke_iseq_block_from_c (th=0x7f45593ac000, captured=<optimized out>, self=139935844059400, argc=<optimized out>, argv=<optimized out>, passed_block_handler=<optimized out>, cref=0x0, splattable=0, is_lambda=0) at vm.c:1019
#25 0x00007f45670bb8bd in invoke_block_from_c_splattable (argc=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>, splattable=<optimized out>, argv=<optimized out>, block_handler=<optimized out>, th=<optimized out>) at vm.c:1037
#26 vm_yield (argc=1, argv=0x7f45513fb868, th=0x7f45593ac000) at vm.c:1079
#27 rb_yield_0 (argv=0x7f45513fb868, argc=1) at vm_eval.c:1010
#28 rb_yield_1 (val=139935792957960) at vm_eval.c:1016
#29 rb_yield (val=<optimized out>) at vm_eval.c:1026
#30 0x00007f4566ed6b5d in rb_ary_each (ary=139935845464800) at array.c:1824
#31 0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f4559315590, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150db00, th=0x7f45593ac000) at vm_insnhelper.c:1769
#32 vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150db00, calling=<optimized out>, ci=0x7f4559315590, cc=<optimized out>) at vm_insnhelper.c:1864
#33 0x00007f45670b6ece in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:967
#34 0x00007f45670ba647 in vm_exec (th=0x7f45593ac000) at vm.c:1775
#35 0x00007f45670bea4a in vm_call0_body (th=<optimized out>, calling=calling@entry=0x7f45513fbec0, ci=ci@entry=0x7f45513fbeb0, cc=cc@entry=0x7f45513fbee0, argv=<optimized out>) at vm_eval.c:177
#36 0x00007f45670bee70 in vm_call0 (me=<optimized out>, argv=<optimized out>, argc=<optimized out>, id=<optimized out>, recv=<optimized out>, th=<optimized out>) at vm_eval.c:62
#37 rb_vm_call (th=<optimized out>, recv=<optimized out>, id=<optimized out>, argc=<optimized out>, argv=<optimized out>, me=<optimized out>) at vm_eval.c:257
#38 0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f4559910d50, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150db60, th=0x7f45593ac000) at vm_insnhelper.c:1769
#39 vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150db60, calling=<optimized out>, ci=0x7f4559910d50, cc=<optimized out>) at vm_insnhelper.c:1864
#40 0x00007f45670b6229 in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:1066
#41 0x00007f45670ba647 in vm_exec (th=0x7f45593ac000) at vm.c:1775
#42 0x00007f45670bea4a in vm_call0_body (th=<optimized out>, calling=calling@entry=0x7f45513fc530, ci=ci@entry=0x7f45513fc520, cc=cc@entry=0x7f45513fc550, argv=<optimized out>) at vm_eval.c:177
#43 0x00007f45670bee70 in vm_call0 (me=<optimized out>, argv=<optimized out>, argc=<optimized out>, id=<optimized out>, recv=<optimized out>, th=<optimized out>) at vm_eval.c:62

Which make me think, Fluentd has no ability to detect infinity regexp loops, which is basically handled by https://rubular.com/ for example.

To Reproduce I can privately send you my config and log with unknown log-string which makes fluentd hangs

Expected behavior Warn to logfile and continue processing logs. Warning must include entire log-string like you do with unmatched strings. There should be ability to match this logs and apply another regexp (i.e. retry to process more simpler regexp) Your Environment

And tested with 1.4 and 1.5 Fluentd (and other versions till 1.0) - same issue I don't know why old version ruby 2.4.6 is using in production, when its EOL:

# /opt/td-agent/embedded/bin/ruby -v ruby 2.4.6p354 (2019-04-01 revision 67394) [x86_64-linux]

Your Configuration

Part of config was already mentioned above
Because of NDA I can't paste my config here, I can share with developer my config and example of logs which causing fluend hanging

Your Error Log

In debug mode - no new logs at all, no new chunks created etc.
In trace mode - same
kvborodin commented 5 years ago

Same time, Logstash has no issues consuming my 2 log files with the same regex This make me think current v.1.x of Fluentd is not production ready at all I'm going to switch to Logstash for all my projects, guys You should remove "production-ready" from your official site Thanks

repeatedly commented 5 years ago

Hmm... to debug this, we need your actual log for the investigation. logstash uses java based regexp engine and ruby(cruby) uses other regexp engine. So this seems to cause the different behaviour.

kvborodin commented 5 years ago

Cool! I'll send you privately (repeatedly@gmail.com) when I got approve from my PM/TL

kvborodin commented 5 years ago

What about SIGCONT sig during stuck? Any thoughts why it's always works but not in my situation?

repeatedly commented 5 years ago

Any thoughts why it's always works but not in my situation?

This is log content / regexp combo issue or your regexp hits the bug of ruby's regexp engine. Each regexp engine has each algorithm/pattern and it sometimes causes this issue with specific pattern. For example, java's regexp also has a similar problem: https://stackoverflow.com/questions/37484078/regexp-causes-to-hang-infinitely

kvborodin commented 5 years ago

@repeatedly look, Logstash have GROK. GROK have timeouts on each regex matching operation which can be tuned. Logstash also log this unmatched/timeouted lines to separate log-file for the next investigation and next updating regex which is fails during processing. Without this ability you can't develop logs parsing matches, right? I mean, how can I understand which string from my multi format log made Fluentd stuck?

I just suggest to make improvements and add timeouts & logging for those users who is using regex patterns!

Also, the problem is with hanging Fluentd with infinity regex loops, this improvement which I suggest, can solve all this problems with hanging/stuck processing input records!

kvborodin commented 5 years ago

I was talking about this feature: https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html#plugins-filters-grok-timeout_millis

also please take a look on this features:

They are so helpful!

Thanks!

repeatedly commented 5 years ago

grok is the collection of regexp so I assume logstash also has similar regexp problem. The ruby's thread model is different from java but I will check we can use similar handling for regexp.

repeatedly commented 5 years ago

Wrote patch for this issue: https://github.com/fluent/fluentd/pull/2513