Closed Watson1978 closed 3 weeks ago
Looks like the nodes which LoadBalancer
has are unavailable when invoke after_shutdown
at https://github.com/fluent/fluentd/blob/f493f25d0d33da8e16bab59d7856105788cdc7a5/test/plugin/test_out_forward.rb#L20
#<Thread:0x00000001378ecf50@flush_thread_0 /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:70 run> terminated with exception (report_on_exception is true):
/Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:57:in `select_healthy_node': no nodes are available (Fluent::Plugin::ForwardOutput::NoNodesAvailable)
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in `select_service'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in `try_write'
from /Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in `try_flush'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
from /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
EE, [2024-10-27T04:01:58.832977 #6300] ERROR -- : unexpected error while after_shutdown, Fluent::Plugin::ForwardOutput::NoNodesAvailable:no nodes are available
E, [2024-10-27T04:01:58.833250 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:57:in `select_healthy_node'
E, [2024-10-27T04:01:58.833282 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in `select_service'
E, [2024-10-27T04:01:58.833304 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service'
E, [2024-10-27T04:01:58.833326 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in `try_write'
E, [2024-10-27T04:01:58.833346 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write'
E, [2024-10-27T04:01:58.833365 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in `try_flush'
E, [2024-10-27T04:01:58.833385 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush'
E, [2024-10-27T04:01:58.833403 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
E, [2024-10-27T04:01:58.833421 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
E, [2024-10-27T04:01:58.833439 #6300] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
If it takes more than 3 seconds to run the test, it always seems to fail.
test: a node supporting responses: EE, [2024-10-27T21:22:23.236247 #6961] ERROR -- : unexpected error while after_shutdown, Fluent::Plugin::ForwardOutput::NoNodesAvailable:no nodes are available
E, [2024-10-27T21:22:23.236578 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward/load_balancer.rb:55:in `select_healthy_node'
E, [2024-10-27T21:22:23.236602 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery/manager.rb:108:in `select_service'
E, [2024-10-27T21:22:23.236616 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/service_discovery.rb:82:in `service_discovery_select_service'
E, [2024-10-27T21:22:23.236629 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:375:in `try_write'
E, [2024-10-27T21:22:23.236642 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/test/plugin/test_out_forward.rb:57:in `try_write'
E, [2024-10-27T21:22:23.236655 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1216:in `try_flush'
E, [2024-10-27T21:22:23.236667 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/out_forward.rb:353:in `try_flush'
E, [2024-10-27T21:22:23.236679 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
E, [2024-10-27T21:22:23.236692 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
E, [2024-10-27T21:22:23.236705 #6961] ERROR -- : /Users/runner/work/fluentd/fluentd/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
: (3.628359)
Whe I added a following debug print, it shows the log that a timeout occurred in the ack_handler.
diff --git a/lib/fluent/plugin/out_forward/ack_handler.rb b/lib/fluent/plugin/out_forward/ack_handler.rb
index e1d19ae7..d4600fe5 100644
--- a/lib/fluent/plugin/out_forward/ack_handler.rb
+++ b/lib/fluent/plugin/out_forward/ack_handler.rb
@@ -45,6 +45,7 @@ def collect_response(select_interval)
new_list = []
@mutex.synchronize do
@ack_waitings.each do |info|
+ puts "\n[debug] ACKWaitingSockInfo expired: #{[info.expired?(now), now, info.expired_time]}"
if info.expired?(now)
# There are 2 types of cases when no response has been received from socket:
# (1) the node does not support sending responses
test: a node supporting responses:
[debug] ACKWaitingSockInfo expired: [true, 1705.749811958, 1705.727845291]
Then, It seems that ForwardOutput::Node is disabled by timeout.
Thanks for investigating this failure.
However, there are still some unclear points for me...
The error happens in the flush
process.
The ack info is enqueued after it. (after send_data
).
So, it seems impossible that the first flush
fails due to the ack expiring.
If it's possible, it's a retry.
Time-series:
However, the test set retry_wait
to 30s
, so it should be timeout error (target_input_driver.run(expect_records: 2, timeout: 5)
)...
Even if this scenario were true, there must be something wrong with the flash thread dying by NoNodesAvailable if the ACK times out...
The failure has following behavior as far as I have been able to find out,
Result::FAILED
will be returned if expired at
https://github.com/fluent/fluentd/blob/f493f25d0d33da8e16bab59d7856105788cdc7a5/lib/fluent/plugin/out_forward/ack_handler.rb#L48-L53Result::FAILED
https://github.com/fluent/fluentd/blob/f493f25d0d33da8e16bab59d7856105788cdc7a5/lib/fluent/plugin/out_forward.rb#L523-L525diff --git a/lib/fluent/plugin/out_forward.rb b/lib/fluent/plugin/out_forward.rb
index 9a07acbe..549bdd7a 100644
--- a/lib/fluent/plugin/out_forward.rb
+++ b/lib/fluent/plugin/out_forward.rb
@@ -266,7 +266,7 @@ def configure(conf)
log.warn('The value of keepalive_timeout is ignored. if you want to use keepalive, please add `keepalive true` to your conf.')
end
- raise Fluent::ConfigError, "ack_response_timeout must be a positive integer" if @ack_response_timeout < 1
+ # raise Fluent::ConfigError, "ack_response_timeout must be a positive integer" if @ack_response_timeout < 1
@healthy_nodes_count_metrics = metrics_create(namespace: "fluentd", subsystem: "output", name: "healthy_nodes_count", help_text: "Number of count healthy nodes", prefer_gauge: true)
@registered_nodes_count_metrics = metrics_create(namespace: "fluentd", subsystem: "output", name: "registered_nodes_count", help_text: "Number of count registered nodes", prefer_gauge: true)
diff --git a/test/plugin/test_out_forward.rb b/test/plugin/test_out_forward.rb
index fa4f05f3..a3d83a95 100644
--- a/test/plugin/test_out_forward.rb
+++ b/test/plugin/test_out_forward.rb
@@ -611,7 +611,7 @@ def try_write(chunk)
@d = d = create_driver(config + %[
require_ack_response true
- ack_response_timeout 1s
+ ack_response_timeout 0s
<buffer tag>
flush_mode immediate
retry_type periodic
If you change the code like above, you can reproduce easily with following command.
bundle exec ruby -w -I"lib:test" -Eascii-8bit:ascii-8bit test/plugin/test_out_forward.rb --name='test: a node supporting responses'
Thanks! I will try it!
Thanks! I can reproduce this by https://github.com/fluent/fluentd/issues/4684#issuecomment-2440936541.
The second flush becomes an error. This scenario is likely occurring.
https://github.com/fluent/fluentd/issues/4684#issuecomment-2440692192
1. Flush 2. Enqueue the ack 3. Ack timeout (if ack response is slow) 4. Retry flush 5. NoNodesAvailable due to 3.
What is very strange is that the retry interval (30s
) is not applied to 4.
.
Somehow, flush_interval
(1s
) is applied.
Something should be wrong, but I can't figure it out yet.
What is very strange is that the retry interval (
30s
) is not applied to4.
. Somehow,flush_interval
(1s
) is applied.
I have found the cause.
out_forward
does not update RetryState
in rollback_write
because using update_retry: false
.
I don't know why, but it makes the retry run at 1s intervals. On the other hand, there is another feature to rollback chunks automatically:
If this is done first, a retry interval is applied.
Normally, the original logic of out_forward
should be done first because of the timeout difference:
The remaining concern is
Even if this scenario were true, there must be something wrong with the flash thread dying by NoNodesAvailable if the ACK times out...
The remaining concern is
Even if this scenario were true, there must be something wrong with the flash thread dying by NoNodesAvailable if the ACK times out...
I have confirmed this is the test-only behavior.
OK. Now, this issue is clear to me. Thanks!
Thanks!
Describe the bug
Sometimes, CI fails with "Fluent::Plugin::ForwardOutput::NoNodesAvailable: no nodes are available" on macOS platform.
To Reproduce
Run CI
Expected behavior
Succeeded
Your Environment
Your Configuration
Your Error Log
Additional context
No response