fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.93k stars 1.34k forks source link

CI: Fails with "Fluent::Plugin::ForwardOutput::NoNodesAvailable: no nodes are available" #4684

Closed Watson1978 closed 3 weeks ago

Watson1978 commented 1 month ago

Describe the bug

Sometimes, CI fails with "Fluent::Plugin::ForwardOutput::NoNodesAvailable: no nodes are available" on macOS platform.

1. Error: test: a node supporting responses(ForwardOutputTest): Fluent::Plugin::ForwardOutput::NoNodesAvailable: no nodes are available
/Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:55:in `select_healthy_node' /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in` select_service'
/Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service' /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in` try_write'
/Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write' /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in` try_flush'
/Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush' /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in` flush_thread_run'
/Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start' /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in` block in thread_create'

To Reproduce

Run CI

Expected behavior

Succeeded

Your Environment

- Fluentd version:
- Package version:
- Operating system:
- Kernel version:

Your Configuration

none

Your Error Log

See above

Additional context

No response

Watson1978 commented 1 month ago

Looks like the nodes which LoadBalancer has are unavailable when invoke after_shutdown at https://github.com/fluent/fluentd/blob/f493f25d0d33da8e16bab59d7856105788cdc7a5/test/plugin/test_out_forward.rb#L20

#<Thread:0x00000001378ecf50@flush_thread_0 /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:70 run> terminated with exception (report_on_exception is true):
/Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:57:in `select_healthy_node': no nodes are available (Fluent::Plugin::ForwardOutput::NoNodesAvailable)
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in `select_service'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in `try_write'
    from /Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in `try_flush'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
    from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
EE, [2024-10-27T04:01:58.832977 #6300] ERROR -- : unexpected error while after_shutdown, Fluent::Plugin::ForwardOutput::NoNodesAvailable:no nodes are available
E, [2024-10-27T04:01:58.833250 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:57:in `select_healthy_node'
E, [2024-10-27T04:01:58.833282 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in `select_service'
E, [2024-10-27T04:01:58.833304 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service'
E, [2024-10-27T04:01:58.833326 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in `try_write'
E, [2024-10-27T04:01:58.833346 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write'
E, [2024-10-27T04:01:58.833365 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in `try_flush'
E, [2024-10-27T04:01:58.833385 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush'
E, [2024-10-27T04:01:58.833403 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
E, [2024-10-27T04:01:58.833421 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
E, [2024-10-27T04:01:58.833439 #6300] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
Watson1978 commented 4 weeks ago

If it takes more than 3 seconds to run the test, it always seems to fail.

  test: a node supporting responses:            EE, [2024-10-27T21:22:23.236247 #6961] ERROR -- : unexpected error while after_shutdown, Fluent::Plugin::ForwardOutput::NoNodesAvailable:no nodes are available
E, [2024-10-27T21:22:23.236578 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:55:in `select_healthy_node'
E, [2024-10-27T21:22:23.236602 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in `select_service'
E, [2024-10-27T21:22:23.236616 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service'
E, [2024-10-27T21:22:23.236629 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in `try_write'
E, [2024-10-27T21:22:23.236642 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write'
E, [2024-10-27T21:22:23.236655 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in `try_flush'
E, [2024-10-27T21:22:23.236667 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush'
E, [2024-10-27T21:22:23.236679 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
E, [2024-10-27T21:22:23.236692 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
E, [2024-10-27T21:22:23.236705 #6961] ERROR -- :    /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
: (3.628359)
Watson1978 commented 4 weeks ago

Whe I added a following debug print, it shows the log that a timeout occurred in the ack_handler.

diff --git a/lib/fluent/plugin/out_forward/ack_handler.rb b/lib/fluent/plugin/out_forward/ack_handler.rb
index e1d19ae7..d4600fe5 100644
--- a/lib/fluent/plugin/out_forward/ack_handler.rb
+++ b/lib/fluent/plugin/out_forward/ack_handler.rb
@@ -45,6 +45,7 @@ def collect_response(select_interval)
           new_list = []
           @mutex.synchronize do
             @ack_waitings.each do |info|
+              puts "\n[debug] ACKWaitingSockInfo expired: #{[info.expired?(now), now, info.expired_time]}"
               if info.expired?(now)
                 # There are 2 types of cases when no response has been received from socket:
                 # (1) the node does not support sending responses
  test: a node supporting responses:            
[debug] ACKWaitingSockInfo expired: [true, 1705.749811958, 1705.727845291]

Then, It seems that ForwardOutput::Node is disabled by timeout.

daipom commented 4 weeks ago

Thanks for investigating this failure.

However, there are still some unclear points for me...

The error happens in the flush process. The ack info is enqueued after it. (after send_data). So, it seems impossible that the first flush fails due to the ack expiring. If it's possible, it's a retry.

Time-series:

  1. Flush
  2. Enqueue the ack
  3. Ack timeout (if ack response is slow)
  4. Retry flush
  5. NoNodesAvailable due to 3.

However, the test set retry_wait to 30s, so it should be timeout error (target_input_driver.run(expect_records: 2, timeout: 5))...

Even if this scenario were true, there must be something wrong with the flash thread dying by NoNodesAvailable if the ACK times out...

Watson1978 commented 4 weeks ago

The failure has following behavior as far as I have been able to find out,

  1. Result::FAILED will be returned if expired at https://github.com/fluent/fluentd/blob/f493f25d0d33da8e16bab59d7856105788cdc7a5/lib/fluent/plugin/out_forward/ack_handler.rb#L48-L53
  2. The node is disabled by Result::FAILED https://github.com/fluent/fluentd/blob/f493f25d0d33da8e16bab59d7856105788cdc7a5/lib/fluent/plugin/out_forward.rb#L523-L525
  3. After then, flush will be failed because no avaiable node are exists.
Watson1978 commented 4 weeks ago
diff --git a/lib/fluent/plugin/out_forward.rb b/lib/fluent/plugin/out_forward.rb
index 9a07acbe..549bdd7a 100644
--- a/lib/fluent/plugin/out_forward.rb
+++ b/lib/fluent/plugin/out_forward.rb
@@ -266,7 +266,7 @@ def configure(conf)
         log.warn('The value of keepalive_timeout is ignored. if you want to use keepalive, please add `keepalive true` to your conf.')
       end

-      raise Fluent::ConfigError, "ack_response_timeout must be a positive integer" if @ack_response_timeout < 1
+      # raise Fluent::ConfigError, "ack_response_timeout must be a positive integer" if @ack_response_timeout < 1
       @healthy_nodes_count_metrics = metrics_create(namespace: "fluentd", subsystem: "output", name: "healthy_nodes_count", help_text: "Number of count healthy nodes", prefer_gauge: true)
       @registered_nodes_count_metrics = metrics_create(namespace: "fluentd", subsystem: "output", name: "registered_nodes_count", help_text: "Number of count registered nodes", prefer_gauge: true)

diff --git a/test/plugin/test_out_forward.rb b/test/plugin/test_out_forward.rb
index fa4f05f3..a3d83a95 100644
--- a/test/plugin/test_out_forward.rb
+++ b/test/plugin/test_out_forward.rb
@@ -611,7 +611,7 @@ def try_write(chunk)

     @d = d = create_driver(config + %[
       require_ack_response true
-      ack_response_timeout 1s
+      ack_response_timeout 0s
       <buffer tag>
         flush_mode immediate
         retry_type periodic

If you change the code like above, you can reproduce easily with following command.

bundle exec ruby -w -I"lib:test" -Eascii-8bit:ascii-8bit test/plugin/test_out_forward.rb --name='test: a node supporting responses'
daipom commented 4 weeks ago

Thanks! I will try it!

daipom commented 4 weeks ago

Thanks! I can reproduce this by https://github.com/fluent/fluentd/issues/4684#issuecomment-2440936541.

The second flush becomes an error. This scenario is likely occurring.

https://github.com/fluent/fluentd/issues/4684#issuecomment-2440692192

1. Flush
2. Enqueue the ack
3. Ack timeout (if ack response is slow)
4. Retry flush
5. NoNodesAvailable due to 3.

What is very strange is that the retry interval (30s) is not applied to 4.. Somehow, flush_interval (1s) is applied. Something should be wrong, but I can't figure it out yet.

daipom commented 4 weeks ago

What is very strange is that the retry interval (30s) is not applied to 4.. Somehow, flush_interval (1s) is applied.

I have found the cause. out_forward does not update RetryState in rollback_write because using update_retry: false.

https://github.com/fluent/fluentd/blob/4f80067c2329dad36fd921af965e7c606f57a10e/lib/fluent/plugin/out_forward.rb#L516-L536

I don't know why, but it makes the retry run at 1s intervals. On the other hand, there is another feature to rollback chunks automatically:

https://github.com/fluent/fluentd/blob/4f80067c2329dad36fd921af965e7c606f57a10e/lib/fluent/plugin/output.rb#L1550-L1559

If this is done first, a retry interval is applied. Normally, the original logic of out_forward should be done first because of the timeout difference:

https://github.com/fluent/fluentd/blob/4f80067c2329dad36fd921af965e7c606f57a10e/lib/fluent/plugin/out_forward.rb#L283-L290

https://github.com/fluent/fluentd/blob/4f80067c2329dad36fd921af965e7c606f57a10e/lib/fluent/plugin/out_forward.rb#L500-L506

https://github.com/fluent/fluentd/blob/4f80067c2329dad36fd921af965e7c606f57a10e/lib/fluent/plugin/output.rb#L1213

daipom commented 4 weeks ago

The remaining concern is

Even if this scenario were true, there must be something wrong with the flash thread dying by NoNodesAvailable if the ACK times out...

daipom commented 4 weeks ago

The remaining concern is

Even if this scenario were true, there must be something wrong with the flash thread dying by NoNodesAvailable if the ACK times out...

I have confirmed this is the test-only behavior.

https://github.com/fluent/fluentd/blob/4f80067c2329dad36fd921af965e7c606f57a10e/lib/fluent/plugin/output.rb#L1283

OK. Now, this issue is clear to me. Thanks!

daipom commented 3 weeks ago

Thanks!