[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443

LucasHantz commented 2 years ago

Describe the question/issue

I'm having a broken connection error to Firehose and Cloudwatch on containers with low traffic as they are in a staging environment. Once they log this error, the RAM usage is only growing until they reach the maximum threshold and the task is killed.

Configuration

[SERVICE]
    Parsers_File /parser.conf
    Streams_File /stream_processing.conf
    Flush 1
    Grace 30

    ## FB Metrics
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

[INPUT]
    Name tcp
    Alias tcp.atom
    Listen 127.0.0.1
    Port 5170
    Chunk_Size 32
    Buffer_Size 64
    Format json
    Tag application

[FILTER]
    Name parser
    Match platform*
    Key_Name log
    Parser json
    Reserve_Data True

[FILTER]
    Name modify
    Match application*
    Rename ecs_task_arn task_id

[OUTPUT]
    Name kinesis_firehose
    Alias kinesis.atom-logs
    Match application.logs*
    region ${AWS_REGION}
    delivery_stream atom-logs
    workers 1

[OUTPUT]
    Name kinesis_firehose
    Alias kinesis.atom-metrics
    Match application.metrics*
    region ${AWS_REGION}
    delivery_stream atom-metrics
    workers 1

### METRICS ###

# Configure FB to scrape its own prom metrics
[INPUT]
    Name exec
    Alias exec.metric
    Command curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
    Interval_Sec 30
    Tag fb_metrics

# Filter out everything except output metrics
# Customize this to change which metrics are sent
[FILTER]
    Name grep
    Match fb_metrics
    Regex exec (input|output)

# Filter out the HELP and TYPE fields which aren't parseable by the cw metric filter
[FILTER]
    Name grep
    Match fb_metrics
    Exclude exec HELP

[FILTER]
    Name grep
    Match fb_metrics
    Exclude exec TYPE

# Parse the metrics to json for easy parsing in CW Log Group Metrics filter
[FILTER]
    Name parser
    Match fb_metrics
    Key_Name exec
    Parser fluentbit_prom_metrics_to_json
    Reserve_Data True

# Send the metrics as CW Logs
# The CW Metrics filter on the log group will turn them into metrics
# Use hostname in logs to differentiate log streams per task in Fargate
# Alternative is to use: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
[OUTPUT]
    Name cloudwatch_logs
    Alias cloudwatch.fb_metrics
    Match fb_metrics
    region ${AWS_REGION}
    log_group_name ${FLUENT_BIT_METRICS_LOG_GROUP}
    log_stream_name metrics
    retry_limit 2

Fluent Bit Log Output

[1mFluent Bit v1.9.3[0m
--
* [1m[93mCopyright (C) 2015-2022 The Fluent Bit Authors[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/05/23 19:34:30] [ info] [fluent bit] version=1.9.3, commit=a313296229, pid=1
[2022/05/23 19:34:30] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/23 19:34:30] [ info] [cmetrics] version=0.3.1
[2022/05/23 19:34:30] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
[2022/05/23 19:34:30] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
[2022/05/23 19:34:30] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
[2022/05/23 19:34:30] [ info] [input:tcp:tcp.atom] listening on 127.0.0.1:5170
[2022/05/23 19:34:30] [ info] [output:kinesis_firehose:kinesis.atom-logs] worker #0 started
[2022/05/23 19:34:30] [ info] [output:kinesis_firehose:kinesis.atom-metrics] worker #0 started
[2022/05/23 19:34:30] [ info] [output:null:null.3] worker #0 started
[2022/05/23 19:34:31] [ info] [output:cloudwatch_logs:cloudwatch.atom] worker #0 started
[2022/05/23 19:34:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/05/23 19:34:31] [ info] [sp] stream processor started
[2022/05/23 19:34:31] [ info] [sp] registered task: stream.logs
[2022/05/23 19:34:31] [ info] [sp] registered task: stream.metrics
[2022/05/23 19:34:32] [ info] [output:cloudwatch_logs:cloudwatch.atom] Creating log stream web/platform-firelens-0a4943272e2341abbb02344e7ee3b47d in log group /ecs/atom-platform
[2022/05/23 19:34:32] [ info] [output:cloudwatch_logs:cloudwatch.atom] Created log stream web/platform-firelens-0a4943272e2341abbb02344e7ee3b47d
[2022/05/23 19:35:01] [ info] [output:cloudwatch_logs:cloudwatch.fb_metrics] Creating log stream metrics in log group /firelens/atom-platform
[2022/05/23 19:35:01] [ info] [output:cloudwatch_logs:cloudwatch.fb_metrics] Log Stream metrics already exists
[2022/05/23 19:35:41] [error] [net] connection #43 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send log records to atom-logs
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send log records
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records to atom-metrics
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send records
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send records
[2022/05/23 20:40:56] [ warn] [engine] failed to flush chunk '1-1653338455.964299341.flb', retry in 8 seconds: task_id=1, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/23 20:40:56] [ warn] [engine] failed to flush chunk '1-1653338455.964432859.flb', retry in 10 seconds: task_id=0, input=stream.logs > output=kinesis.atom-logs (out_id=0)
[2022/05/23 20:41:04] [ info] [engine] flush chunk '1-1653338455.964299341.flb' succeeded at retry 1: task_id=1, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/23 20:41:06] [ info] [engine] flush chunk '1-1653338455.964432859.flb' succeeded at retry 1: task_id=0, input=stream.logs > output=kinesis.atom-logs (out_id=0)
[2022/05/23 21:20:41] [error] [net] connection #164 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 22:00:11] [error] [net] connection #174 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 22:38:41] [error] [net] connection #189 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/24 04:50:11] [error] [net] connection #211 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/24 05:21:30] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/24 05:21:30] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records to atom-metrics
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send records
[2022/05/24 05:21:30] [ warn] [engine] failed to flush chunk '1-1653369689.464347392.flb', retry in 8 seconds: task_id=0, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/24 05:21:38] [ info] [engine] flush chunk '1-1653369689.464347392.flb' succeeded at retry 1: task_id=0, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)

Fluent Bit Version Info

I'm reproducing the same error with the stable and latest version of the image.

Cluster Details

ECS Fargate with awsvpc networking. The Firehose and Cloudwatch VPC endpoints are enabled.

Application Details

NTR

Steps to reproduce issue

We have run load testing on the container with the same configuration without noticing this error, so it seems this error is happening when the throughput is low.

Related Issues

This is the new configuration I've come up with based on the recommendation given here: https://github.com/aws/aws-for-fluent-bit/issues/351

Let me know if I did something wrong.

DrewZhang13 commented 2 years ago

This is current recommendation for CloudWatch plugin config. Could you try these config?

Also i wonder how your load testing is running? These errors shows only in low throughput but not high throughput seems not make sense to me.

LucasHantz commented 2 years ago

The above graphs are of a load test we did on our application and the metrics generated by Firelens during that time. We had no "[http_client] broken connection" during that time but we had new errors later during that day when the cluster was idle.

From what I see in the guidance, this config helps for high throughput cases which is not the problem here. Should I try it anyway?

LucasHantz commented 2 years ago

Tried with the new config, and still seeing: [error] [net] connection #51 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443

LucasHantz commented 2 years ago

Any new on this please?

zhonghui12 commented 2 years ago

@LucasHantz , may I confirm with you that the issue only occurs with low throughput? So you can confirm that you run fluent bit in the same way and same config and you see problems only with a lower ingestion rate right? May I know what the throughput is?

LucasHantz commented 2 years ago

The issue happens in fact both the low and high throughput. The following graph is the number of record per minute in the last 8h.

As you can see, 2 times in the last 8 hours we have fluent bit falling and not reporting any new logs. At this time this is the error log we have: [2022/06/21 14:15:31] [error] [upstream] connection #609 to firehose.eu-west-1.amazonaws.com:443 timed out after 10 seconds [2022/06/21 14:15:31] [error] [aws_client] connection initialization error

Until the fluentbit container exploded in memory and force the whole task to shutdown

LucasHantz commented 2 years ago

Any thoughts on this? What I can provide more to help figure out this problem?

LucasHantz commented 2 years ago

Just saw the issue raised on https://github.com/fluent/fluent-bit/issues/5705 I'm getting this error as well in our traces

LucasHantz commented 2 years ago

@PettitWesley maybe? Any way to get that pushed up in the line, it's impacting our prod and I don't see how to revert back to a stable solution on this

PettitWesley commented 2 years ago

@LucasHantz Unfortunately right now I don't have any good ideas beyond using the settings here: https://github.com/aws/aws-for-fluent-bit/issues/340

And checking this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues

aws / aws-for-fluent-bit