aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 612 forks source link

Troubleshooting fluentd logging driver if fluent-bit container is down #3620

Closed Galactic21 closed 1 year ago

Galactic21 commented 1 year ago

Summary

Troubleshooting fluentd logging driver if fluent-bit container is down

Description

My services are in ECS and they are all connected to the fluent-bit container from the fluend logging driver. What I want to find out is what happens to the driver if the fluent-bit container is down or the connection between services breaks. I also wanted to know if the logs related to the fluentd logging driver error appear only on the ecs-agent side or if they are stored somewhere other than:

Expected Behavior

After the fluent-bit service goes down, there is some way to warn the user that the connection was broken or this error appears without having to restart docker.

Observed Behavior

After restarting docker or deploying the task again, the service container that is trying to connect to the fluent-bit service starts logging the error as you can see below image.

Environment Details

Amazon Linux 2 EC2 instance.

Supporting Log Snippets

logging_driver_erro
YashdalfTheGray commented 1 year ago

Hello! I'm going to pull in @PettitWesley here and have him help us out.

PettitWesley commented 1 year ago

@Galactic21 Give me more details on your use case. Are you using fluentd log driver directly or via FireLens? https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

FireLens uses the fluentd-async option which means you don't fail container start if you can't connect. Messages will be dropped after the fluentd-buffer-limit fills up IIRC.

IIRC, I tested this and there's no nice error message for it that's emitted when it can't connect in async... but I might be wrong... you should test it.

fluentd logging driver error appear only on the ecs-agent side or if they are stored somewhere other than:

The error messages should only go to the Docker daemon logs, which on ECS AMIs go to journald and I can read them with sudo journalctl -fu docker.service

https://docs.docker.com/config/containers/logging/fluentd/#fluentd-async

Galactic21 commented 1 year ago

@PettitWesley Good afternoon, I apologize for the delay in responding. I'm directly using a centered fluent-bit container and not firelens. I enabled that option and it seemed to work for me, I had misunderstood this setting so thanks. My only doubt now is whether it is possible to receive an alert to know if the fluent-bit container was completely down or not.

SreeeS commented 1 year ago

Hi @PettitWesley can you please clarify whether if it is possible to receive an alert?

PettitWesley commented 1 year ago

@SreeeS @Galactic21 I think if you want to track Fluent Bit uptime, the best method would be to track a metric emitted by it. Rather than relying on error messages in the journald docker daemon logs.

For FireLens, we have these metrics and health check tutorials which could be used for non-FireLens Fluent Bit running in ECS as well:

If Fluent Bit fails health checks, then you know it must be down/non-responsive. If you make it an essential container, the task it is a part of will fail/die if it fails health checks or stops for any reason. You could then monitoring for stopped tasks.

If you want an "Is Fluent Bit running per task metric" and those above options do not look appealing you might be able to just log the task ID to a log group every so often, and create a metric filter on that. We have our init tag which can give you the task ID as an env var: https://github.com/aws/aws-for-fluent-bit/blob/mainline/use_cases/init-process-for-fluent-bit/README.md

You could then use an exec input to have Fluent Bit send you the task ID if its up every so often: https://docs.fluentbit.io/manual/pipeline/inputs/exec

If this isn't clear I can give more details if the idea is interesting.

Does this help? What do you think?

SreeeS commented 1 year ago

@PettitWesley Thank you for your detailed explanation and pointing to useful resources. This is really helpful. @Galactic21 does this clarify your question?

PettitWesley commented 1 year ago

The idea I propose here (which I haven't tested or prototyped yet) might be interesting: https://github.com/aws-samples/amazon-ecs-firelens-examples/issues/111

Galactic21 commented 1 year ago

I think I understand everything explained and I will apply it to my infrastructure, thank you very much. Cheers