Thread Errors - Deadlock? Other source?

GLStephen commented 7 years ago

I'm seeing errors in fluent-plugin-bigquery that seem similar to deadlock issues that have been reporting in general with FluentD 14. However, since the errors are specifically referencing fluent-plugin-bigquery I wanted to drop them in here. Things to know: Restart takes a long time and finally dies with these errors.

2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidWon] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a26fc20@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidResponse] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f264245c210@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidRequested] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a25e3f8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidTimeout] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f2642466a08@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:enqueue_thread thread=#<Thread:0x007efd4a23f0e8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_2 thread=#<Thread:0x007f264242faa8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_2 thread=#<Thread:0x007efd4a245718@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_3 thread=#<Thread:0x007efd4a244368@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::ForwardInput title=:event_loop thread=#<Thread:0x007efd4a237c08@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidResponse] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a267c78@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007efd4a247450@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqPumpBidTimeout] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a27b958@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #1 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007efd4a24e1d8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::ForwardInput title=:event_loop thread=#<Thread:0x007f26424136f0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f2642435020@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_3 thread=#<Thread:0x007f264242eec8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqauctionInit] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_1 thread=#<Thread:0x007f2642434508@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidRequested] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f264244dcd8@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidWon] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:flush_thread_0 thread=#<Thread:0x007f26424647d0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil
2017-08-24 16:41:41 +0000 [warn]: #0 [bqPumpBidRequested] thread doesn't exit correctly (killed or other reason) plugin=Fluent::Plugin::BigQueryOutput title=:enqueue_thread thread=#<Thread:0x007f264244d3a0@/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:70 aborting> error=nil

joker1007 commented 7 years ago

I have never seen this error. I think that fluent-plugin-bigquery does not cause this. If bigquery itself is unstable, flush thread of fluent-plugin-bigquery has long wait. It cause this error maybe. Would you give me your configuration and plugin version.

GLStephen commented 7 years ago

fluent-plugin-biquery version v1.0.0 with this manually patched https://github.com/kaizenplatform/fluent-plugin-bigquery/pull/137/commits/eaec2d13972312297aabdd317b02850b8d1a37cc

With FluentD 14 (td-agent 3)

Failue happens whether using multiple workers or just one. If I remove the customer_id buffer param then it runs fine.

I have 5 eventType handlers like these. One streams, the others load. If flush interval is set higher the system fails faster. It eventually locks up and Forwarders see the Aggregator (config below) as down.

<match prod.push.eventType>
  @type copy

  <store>
    @type bigquery
    @id eventType
    auth_method application_default    

    project project
    dataset dataset
    table eventType_${customer_id}$%Y%m%d

    ignore_unknown_values true
    schema_path /etc/td-agent/eventType.json

    <buffer customer_id, time>          
      @type file
      timekey 1d
      flush_interval 10s
      flush_thread_count 4
      total_limit_size 10GB    
      timekey_use_utc   
    </buffer>

    <secondary bqguid>
        @type file # or forward
        path /var/log/td-agent/buffer/failed/bigquery.buffer.eventType.${customer_id}.*.log
     </secondary>            
  </store>
  <store>
    @type flowcounter
    tag prod.metric.eventType
    unit minute
    aggregate all
  </store>
</match>

4 identical load types

<match prod.push.eventType>
  @type copy

  <store>
    @type bigquery
    @id eventTypeBQ
    auth_method application_default

    method load

    project project
    dataset dataset
    table table_${customer_id}$%Y%m%d

    #fetch_schema true
    schema_path /etc/td-agent/table.json
    max_bad_records 5
    ignore_unknown_values true

    slow_flush_log_threshold 600s

    <buffer customer_id, time>
      @type file
      timekey 1d
      flush_interval 1200s
      flush_at_shutdown true      
      flush_thread_count 2
      timekey_use_utc
    </buffer>
     <secondary bqguid>
        @type file # or forward
        path /var/log/td-agent/buffer/failed/bigquery.buffer.table.${customer_id}.*.log
     </secondary>
  </store>
  <store>
    @type flowcounter
    tag prod.metric.eventType
    unit minute
    aggregate all
  </store>
</match>

Forwarder looks like this. The aggregator seizes up and fails regardless of flush_interval and thread setting.

# Log Forwarding
<match prod.**>
  type forward

  <server>
    host x.x.x.x    
    port 24225
    weight 60
  </server>

  <server>
    host x.x.x.y
    port 24225
    weight 60
  </server>

  <server>
    host x.x.x.z
    port 24225
    weight 60
  </server>

  #
  # use udp for heartbeat
  heartbeat_type udp

  flush_interval .1s
  buffer_chunk_limit 999k
  buffer_queue_limit 60240

  # use multi-threading to send buffered data in parallel
  num_threads 32
  expire_dns_cache 600

  # use file buffer to buffer events on disks.
  buffer_type file
  buffer_path /var/log/td-agent/buffer/forward

  # in case buffer becomes full, have local backup
  <secondary>
    type file
    path /var/log/td-agent/buffer/secondary
    compress gzip
  </secondary>
</match>

GLStephen commented 7 years ago

To provide some more info. I changed messages to have the customer_id in the tag and to route based on that. So, instead of using message data only the tag is used for processing. This issue has since gone away. It seems that the deadlock occurs when the messages are forwarded undifferentiated and have to be split at the aggregator level.

GLStephen commented 7 years ago

I believe I've found the underlying source of the thread issues, I'm just not sure if it is a fluentd issue or this plugin. What I've found is that if you have a lot of reorganization of forwarded chunks then fluend becomes unstable and has many issues. When I have my chunks all aligned to 999k everything runs very well. I tried reducing the forwarded chunks to 100k on a few test servers and the system became unstable. Changing back to 999k all around restored stability. It seems something in the rechunking logic is causing deadlocks or other issues.

joker1007 commented 7 years ago

Hmm.. Thanks for your report!

GLStephen commented 6 years ago

I believe this may have been the source of this issue: https://github.com/fluent/fluentd/issues/1549

joker1007 commented 6 years ago

We had similar issue, but fluentd-0.14.22 resolves it. Thanks!

fluent-plugins-nursery / fluent-plugin-bigquery

Thread Errors - Deadlock? Other source? #143