Closed zhenyami closed 1 year ago
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
https://github.com/fluent/fluent-bit/pull/5649 should solve this, but it is still waiting for confirmation and merge.
5649 should solve this, but it is still waiting for confirmation and merge.
Thanks, @lecaros. Please let me know if anything is required from my side.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
Bug Report
Describe the bug We are using BigQuery output plugin with Workload Identity Federation auth method, which uses AWS EC2 instance temporary credentials to auth requests to GCP BigQuery API. BigQuery output instance sometimes gets completely blocked and all requests fail. Upon closer look, I saw that AWS credentials code has a locking mechanism and occasionally there is a situation where it stays locked forever, blocking all future auth attempts until Fluent Bit is manually restarted.
To investigate, I added extra logging and reproduced the issue by dropping connections to instance metadata endpoint via
iptables
. I found that locking happens when a coroutine starts a connection to metadata endpoint and yields, the connection times out, and that coroutine is never resumed, so the AWS credentials provider is never unlocked. This doesn't happen immediately but after some number of failed attempts (over a hundred in my tests) to retrieve IMDS credentials. I haven't tested any other configurations.Connection timeout, coroutine resumed (custom logs)
Connection timeout, coroutine not resumed (custom logs)
I'm still working on this. Fluent Bit engine and event code is new to me. Decided to post here, in case anybody has any suggestions how to debug this issue.
To Reproduce
Expected behavior
Network errors shouldn't block all output.
Screenshots
N/A
Your Environment
Additional context