CPU stuck at 100% upon network drops to Azure

bradley-carrion commented 3 months ago

### Describe the question/issue Once we enabled aws-for-fluent-bit image with our own fluent bit configuration with new Azure Blob outputs at scale, we see these errors on occasion ``` [error] [http_client] broken connection to {our_storage_account}.blob.core.windows.net:443 ``` After enough of these we see the container reach a point of no return where CPU spikes to 100% and stays there until the ALB finally marks the task as unhealthy. We had to move off of the aws-for-fluent-bit image and onto the latest v3.1.4 of fluent bit. ### Configuration

Fluent Bit Log Output

We have enabled debug logs and nothing in the logs indicate that the CPU should be having issues.

Fluent Bit Version Info

amazon/aws-for-fluent-bit:2.32.2 which uses v1.9.10 of fluent bit under the hood.

Cluster Details

We're running ECS Fargate w/ sidecar deployment of aws-for-fluent-bit.

(This repros locally btw)

Application Details

I was able to repro this locally with the following throughput:

~80 logs / sec
~1kb / log

Steps to reproduce issue

Start the fluent bit container locally with it pointed to azure blob output
Start sending as many logs as you can locally (see above throughput details)
Turn off your network connection so that the requests to Azure start failing, however your requests to fluent bit should continue to succeed
Wait about 30-60s (longer if you want to really pressure test it)
Turn your network connection back on
Repeat steps 2 - 5 or watch the fluent bit container explode

Related Issues

No related issues but a suspect fix is in https://github.com/fluent/fluent-bit/pull/5918

My suggestion would be to consider upgrading to the latest fluent bit version.

swapneils commented 3 months ago

Two questions here to clarify the specific code-segments that are involved:

So upgrading to build aws-for-fluent-bit with 3.1.4 prevented this issue from occurring?
Which output plugin are you using here?

guidoiaquinti commented 1 month ago

Since ~2 hours, this is broken on latest too.

bradley-carrion commented 1 month ago

@swapneils Apologies for the delayed response.

Two questions here to clarify the specific code-segments that are involved:

So upgrading to build aws-for-fluent-bit with 3.1.4 prevented this issue from occurring?

No, we completely dropped the aws-for-fluent-bit image and are purely using the standard fluent-bit 3.1.4 image.

Which output plugin are you using here?

We are using the Azure Blob plugin

swapneils commented 1 month ago

Since ~2 hours, this is broken on latest too.

@guidoiaquinti Are you saying you tested this case ~2 hours ago, or that this case was previously working for you and is now failing with the latest tag?

In the latter case, is the public.ecr.aws/aws-observability/aws-for-fluent-bit:init-debug-2.32.2.20240820 image working without issues? The latest release shouldn't be exhibiting different behavior from stable since we didn't change any fluent-bit code.

guidoiaquinti commented 1 month ago

Maybe this is completely unrelated, and to be honest, I'm not sure what has changed (I'm currently on mobile with limited connectivity), but all our deployments started failing approximately two hours ago with the following errors:

[2024/10/07 20:17:15] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2024/10/07 20:17:15] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory

The timeframe aligns with the update of the latest tag. Reverting to stable fixes it. While not strictly related to this GitHub issue, I arrived here because the bug above seems to be occurring in the same Fluent Bit version of the report.

bradley-carrion commented 1 month ago

Maybe this is completely unrelated, and to be honest, I'm not sure what has changed (I'm currently on mobile with limited connectivity), but all our deployments started failing approximately two hours ago with the following errors:
[2024/10/07 20:17:15] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
[2024/10/07 20:17:15] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
The timeframe aligns with the update of the latest tag. Reverting to stable fixes it. While not strictly related to this GitHub issue, I arrived here because the bug above seems to be occurring in the same Fluent Bit version of the report.

This seems unrelated seeing as my issue is not exclusively on the new latest, did not see the error message you're referring to and they haven't upgraded the underlying fluent bit version from 1.9.10 - which is the compatibility issue I'm calling out here. I'd recommend always using the stable version and creating a new issue for what you're seeing @guidoiaquinti

swapneils commented 1 month ago

Thanks Bradley (and sorry for this additional ping :) )

@guidoiaquinti After making the new Issue, could you pin to 2.32.2.20240820 for the moment and email me an AWS Account ID at swapneis@amazon.com?

The first point is because we plan to update our stable image later this week unless we see issues in stability testing (which I don't expect). Delaying the update further without a clear availability risk would harm other customers' workflows (e.g. security scanning), but I also don't want to break yours.

The account ID is so I can share test aws-for-fluent-bit images with you to facilitate investigation.

aws / aws-for-fluent-bit