fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.8k stars 1.58k forks source link

aws imds: coroutine not resumed on connection timeout, locking output #5553

Closed zhenyami closed 1 year ago

zhenyami commented 2 years ago

Bug Report

Describe the bug We are using BigQuery output plugin with Workload Identity Federation auth method, which uses AWS EC2 instance temporary credentials to auth requests to GCP BigQuery API. BigQuery output instance sometimes gets completely blocked and all requests fail. Upon closer look, I saw that AWS credentials code has a locking mechanism and occasionally there is a situation where it stays locked forever, blocking all future auth attempts until Fluent Bit is manually restarted.

To investigate, I added extra logging and reproduced the issue by dropping connections to instance metadata endpoint via iptables. I found that locking happens when a coroutine starts a connection to metadata endpoint and yields, the connection times out, and that coroutine is never resumed, so the AWS credentials provider is never unlocked. This doesn't happen immediately but after some number of failed attempts (over a hundred in my tests) to retrieve IMDS credentials. I haven't tested any other configurations.

Connection timeout, coroutine resumed (custom logs)
# AWS creds provider is locked
...
[2022/06/09 05:01:28] [ info] [net] created IMDS client socket #59
[2022/06/09 05:01:28] [ info] [net] connection #59 in process to 169.254.169.254:80
[2022/06/09 05:01:28] [ info] [net] yielding coroutine=0x7f5f0dde66a8
...
[2022/06/09 05:01:39] [error] [upstream] connection #59 to 169.254.169.254:80 timed out after 10 seconds
[2022/06/09 05:01:39] [ info] [upstream] injecting connection #59 event back into event loop with type: 1024
[2022/06/09 05:01:39] [ info] [upstream] destroy connection #59 to 169.254.169.254:80
[2022/06/09 05:01:39] [ info] [upstream] deleted event #59:#59 from event loop, status: 0
...
[2022/06/09 05:01:39] [ info] [engine] resuming coroutine=0x7f5f0dde66a8, fd=-1
..
[2022/06/09 05:01:39] [error] [aws_client] connection initialization error
[2022/06/09 05:01:39] [ warn] [imds] failed to retrieve metadata
...
# AWS creds provider is unlocked

[2022/06/09 05:01:39] [ info] [upstream] really destroying upstream 0x7f5f0e7ea300, socket #-1, coro (nil)
Connection timeout, coroutine not resumed (custom logs)
# AWS creds provider is locked
...
[2022/06/09 05:01:43] [ info] [net] created IMDS client socket #59
[2022/06/09 05:01:43] [ info] [net] connection #59 in process to 169.254.169.254:80
[2022/06/09 05:01:43] [ info] [net] yielding coroutine=0x7f5f0dde66a8
...
[2022/06/09 05:01:54] [error] [upstream] connection #59 to 169.254.169.254:80 timed out after 10 seconds
[2022/06/09 05:01:54] [ info] [upstream] injecting connection #59 event back into event loop with type: 1024
[2022/06/09 05:01:54] [ info] [upstream] destroy connection #59 to 169.254.169.254:80
[2022/06/09 05:01:54] [ info] [upstream] deleted event #59:#59 from event loop, status: 0
...
[2022/06/09 05:01:54] [ info] [upstream] really destroying upstream 0x7f5f0e7ea300, socket #-1, coro 0x7f5f0dde66a8
[2022/06/09 05:01:54] [ info] [upstream] connection is still busy, not destroying yet
...
# Engine resume coroutine never called
# AWS creds provider never unlocked

I'm still working on this. Fluent Bit engine and event code is new to me. Decided to post here, in case anybody has any suggestions how to debug this issue.

To Reproduce

Expected behavior

Network errors shouldn't block all output.

Screenshots

N/A

Your Environment

Additional context

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

lecaros commented 2 years ago

https://github.com/fluent/fluent-bit/pull/5649 should solve this, but it is still waiting for confirmation and merge.

zhenyami commented 2 years ago

5649 should solve this, but it is still waiting for confirmation and merge.

Thanks, @lecaros. Please let me know if anything is required from my side.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.