[ECS] [request]: Support Mem_Buf_Limit in FireLens

PettitWesley commented 4 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

With FireLens, the input definitions for Fluent Bit are generated by ECS. This prevents customers from setting any custom options on the input configuration. Mem_Buf_Limit is an input configuration option which sets the total memory available for buffering logs.

This field should probably be optionally configurable by customers. We should determine if there are any other input options as well which might need to be configured.

UPDATE: We will likely work with Fluent Bit upstream to contribute this instead: https://github.com/fluent/fluent-bit/discussions/5711

Which service(s) is this request for? ECS EC2, and ECS Fargate

Are you currently working around this issue?

There's a way to "hack" in input configuration, which is not ideal but I should possibly publish a tutorial on if it is desired...

EDIT: Here is the detailed tutorial: https://aws.amazon.com/blogs/containers/how-to-set-fluentd-and-fluent-bit-input-parameters-in-firelens/

chikinchoi commented 4 years ago

Hi @PettitWesley ,

I would like to add Mem_Buf_Limit to the input plugin for FireLens to avoid the container OOM issue. However, I found that FireLens just support users to modify the output plugin in task definition as you mentioned. To workaround, I downloaded the aws-fluent-bit from github [1] and added the 'Mem_Buf_Limit' to the input plugin of the fluent-bit config file (fluent-bit.conf). However, after I run the service and enter into the container, I found that the 'Mem_Buf_Limit' is missing in fluent-bit/etc/fluent-bit.conf. I think maybe FireLens overwrote the config or I write the 'Mem_Bef_Limit' in the wrong config file. May I know do you have any idea about this issue?

[1] https://github.com/aws/aws-for-fluent-bit Thanks!! Gary

PettitWesley commented 4 years ago

Hey @chikinchoi, there is a workaround that allows you to edit Mem_Buf_Limit right now, though it is slightly inconvenient.

I am working on writing and publishing a short tutorial on that; I can post a shortened version here before the full post is published. Stay tuned.

chikinchoi commented 4 years ago

Hi @PettitWesley , Look forward to your short tuorial!! Thank you very much!

chikinchoi commented 4 years ago

Hey @PettitWesley , just wanna know may I have the general idea for how to implement the wordaround? :)

PettitWesley commented 4 years ago

@chikinchoi Sorry for the delay, here is the short tutorial (which will be improved and cleaned up and published elsewhere in some time).

Background: How FireLens configures Fluentd and Fluent Bit

Before we learn how to set input parameters, we need to understand how FireLens works in detail.

As explained in Under the Hood: FireLens for ECS Tasks: https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

Fluentd and Fluent Bit are powerful, but large feature sets are always accompanied by complexity. When we designed FireLens, we envisioned two major segments of users: Those who want a simple way to send logs anywhere, powered by Fluentd and Fluent Bit. Those who want the full power of Fluentd and Fluent Bit, with AWS managing the undifferentiated labor that’s needed to pipe a Task’s logs to these log routers.

Thus, while fundamentally FireLens just aimed to enable Fluentd and Fluent Bit in ECS and ECS Fargate, we built configuration management features to make that easy. This involved two things:

The Input plugin definitions to accept/collect logs from the runtime are generated by the ECS Agent.
A config translation mechanism was built to translate options in a container’s log configuration to Output plugin definitions.

Consequently, the configuration file for Fluentd or Fluent Bit ile is “fully managed” by ECS. With the config-file-type option, you can import your own configuration. However, the input definitions are always generated by ECS, and your additional config is then imported using the Fluentd/Fluent Bit include statement. Internally, Fluentd and Fluent Bit concatenate the two config files together- so your config is appended to the generated config.

The generated config is always mounted into your log routing container at set locations:

Fluentd: /fluentd/etc/fluent.conf
Fluent Bit: /fluent-bit/etc/fluent-bit.conf Most Fluentd and Fluent Bit images (including the Fluent OSS distributions and the AWS for Fluent Bit distribution) use these default configuration paths. These config paths are specified in the entrypoint definitions for the containers; see for example: https://github.com/fluent/fluent-bit/blob/master/Dockerfile#L103 https://github.com/aws/aws-for-fluent-bit/blob/master/entrypoint.sh#L3 However, you can override the default entrypoint by building your own Fluentd or Fluent Bit image and specifying a different config path. That is the method we will use to set Mem_Buf_Limit.

Tutorial: Setting input parameters (WIP)

The configuration for Fluent Bit is generated by the ECS Agent, and mounted into the FireLens container at /fluent-bit/etc/fluent-bit.conf. The AWS for Fluent Bit container image and the official open source container distribution of Fluent Bit use this as the default configuration path.

The input configuration for FireLens can be seen here; the input definitions are always the same, they do not change based on user input: https://github.com/aws-samples/amazon-ecs-firelens-under-the-hood/blob/master/generated-configs/fluent-bit/generated_by_firelens.conf#L3

Basically, logs are always read from a Unix Socket mounted into the container at /var/run/fluent.sock.

As a FireLens user, you can set your own input configuration by overriding the default entry point command for the Fluent Bit container. See the following: https://github.com/fluent/fluent-bit/blob/master/Dockerfile#L103 https://github.com/aws/aws-for-fluent-bit/blob/master/entrypoint.sh#L3 The exact command depends on the distribution of Fluent Bit, but both set the config path as /fluent-bit/etc/fluent-bit.conf. To “hack” in your own input configuration, simply use a different config path.

If you use AWS for Fluent Bit, override the entry point command to be something like:

/fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/alt/fluent-bit.conf

Build a custom Fluent Bit image with your own configuration file at that location. Remember to set the input definition with the same unix path:

[INPUT]
   Name forward
   unix_path /var/run/fluent.sock

You can then add additional options in this input section.

To make your config dynamic at runtime, remember that you can use environment variables in Fluent Bit config:

[OUTPUT]
   Name cloudwatch
   Match   *
   region ${LOG_REGION}
   log_group_name ${LOG_GROUP}
   log_stream_prefix ${STREAM_PREFIX}
   auto_create_group true

You can then set the values of those environment variables in the FireLens container.

PettitWesley commented 4 years ago

Let me know if any of it is confusing

chikinchoi commented 4 years ago

Hi @PettitWesley ,

Thank you for your update! However, I got a little bit confused about the steps. below are the steps that I have implemented:

Download the aws-fluent-bit source code from github (https://github.com/aws/aws-for-fluent-bit/)
update Dockerfile: from COPY fluent-bit.conf /fluent-bit/etc/ to COPY fluent-bit.conf /fluent-bit/alt/
update entrypoint.sh: from exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf to exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/alt/fluent-bit.conf
update the fluent-bit.conf

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock
    Mem_Buf_Limit 6MB

[INPUT]
    Name forward
    Listen 127.0.0.1
    Port 24224

[INPUT]
    Name tcp
    Tag firelens-healthcheck
    Listen 127.0.0.1
    Port 8877

[OUTPUT]
    Name null
    Match firelens-healthcheck

[OUTPUT]
    Name forward
    Match container-samplems-firelens*
    Host myexternalfluentdendpoint.ap-east-1.amazonaws.com
    Port 24224
    Retry_Limit false

build and push the image to ECR

However, is it mean the fluent-bit cannot get the log configuration key & value from FireLens for updating the output plugin? For example, In order to connect the external Fluentd, I will add 'Host' & 'Port' to the FireLens Log configuration in the sidecar application task definition [1]. In conclusion, I would like to append the Mem_Buf_Limit to the input plugin instead of overwriting the whole fluent-bit.conf. Is it still ok to do it with your solution? Thank you very much!

[1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_firelens.html

PettitWesley commented 4 years ago

@chikinchoi

I would like to append the Mem_Buf_Limit to the input plugin instead of overwriting the whole fluent-bit.conf. Is it still ok to do it with your solution?

At the moment, this is not possible. My workaround is the only solution.

However, is it mean the fluent-bit cannot get the log configuration key & value from FireLens for updating the output plugin?

Yes, all configuration must be in the custom fluent bit configuration file that you add. You can not use the logConfiguration options- specify awsfirelens as your log driver without any options.

PettitWesley commented 4 years ago

I think a possibly good alternative to fixing this in FireLens would be a Service section configuration for Fluent Bit that governs that max memory used for buffering by all inputs.

chikinchoi commented 4 years ago

Hi @PettitWesley ,

May I know is there any update in this case? I am still using the customize fluent-bit image and customize fluent-bit config file in order to control the max buffer limit.

PettitWesley commented 4 years ago

The detailed blog on the workaround has been published: https://aws.amazon.com/blogs/containers/how-to-set-fluentd-and-fluent-bit-input-parameters-in-firelens/

Other than that we don't have an updated ETA on this feature at this time.

psyhomb commented 3 years ago

I think a possibly good alternative to fixing this in FireLens would be a Service section configuration for Fluent Bit that governs that max memory used for buffering by all inputs.

Is this option (Mem_Buf_Limit) already supported on a Service level or this is just a proposal?

psyhomb commented 3 years ago

Is this option (Mem_Buf_Limit) already supported on a Service level or this is just a proposal?

It looks like the answer is NO, it is not supported on a Service level. 😞 https://github.com/fluent/fluent-bit/blob/master/src/flb_config.c#L50-L124

psyhomb commented 3 years ago

One issue with this approach when you are building a custom docker image is inability to generate dynamic records (enable-ecs-log-metadata) that are otherwise generated during runtime by the ECS agent.

Example:

[FILTER]
    Name record_modifier
    Match *
    Record ec2_instance_id i-032ebfbaab58b3ddd
    Record ecs_cluster cluster-1
    Record ecs_task_arn arn:aws:ecs:region:xxxxxxxxxxxxx:task/449b5079-1602-4489-9051-99fb5daeffff
    Record ecs_task_definition service-name:1

UPDATE:

I've devised a solution of how these dynamic records could be passed to FluentBit, also with this solution you'll be able to pass FluentBit configuration parameters as well, via environment variables.

entrypoint.sh

#!/bin/bash

### Fluent Bit configuration parameters (defaults)
## Service section
export FLB_SERVICE_FLUSH=${FLB_SERVICE_FLUSH:-"1"}
export FLB_SERVICE_GRACE=${FLB_SERVICE_GRACE:-"30"}
export FLB_SERVICE_LOG_LEVEL=${FLB_SERVICE_LOG_LEVEL:-"info"}
## Input section
export FLB_INPUT_MEM_BUF_LIMIT=${FLB_INPUT_MEM_BUF_LIMIT:-"100MB"}

### Collect EC2 and ECS metadata
export EC2_INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
export ECS_METADATA=$(curl -s ${ECS_CONTAINER_METADATA_URI_V4})
export ECS_CLUSTER=$(echo ${ECS_METADATA} | python -c "import json, sys; print(json.load(sys.stdin)['Labels']['com.amazonaws.ecs.cluster'])")
export ECS_TASK_ARN=$(echo ${ECS_METADATA} | python -c "import json, sys; print(json.load(sys.stdin)['Labels']['com.amazonaws.ecs.task-arn'])")
export ECS_TASK_DEFINITION_FAMILY=$(echo ${ECS_METADATA} | python -c "import json, sys; print(json.load(sys.stdin)['Labels']['com.amazonaws.ecs.task-definition-family'])")
export ECS_TASK_DEFINITION_VERSION=$(echo ${ECS_METADATA} | python -c "import json, sys; print(json.load(sys.stdin)['Labels']['com.amazonaws.ecs.task-definition-version'])")
export ECS_IMAGE_VERSION=$(echo ${ECS_METADATA} | python -c "import json, sys; print(json.load(sys.stdin)['Image'].split(':')[-1])")
export ECS_TASK_DEFINITION="${ECS_TASK_DEFINITION_FAMILY}:${ECS_TASK_DEFINITION_VERSION}"

echo "AWS for Fluent Bit Container Image Version ${ECS_IMAGE_VERSION}"
exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/alt/fluent-bit.conf

Full example of FluentBit configuration file.

fluent-bit.conf

[SERVICE]
    Flush               ${FLB_SERVICE_FLUSH}
    Grace               ${FLB_SERVICE_GRACE}
    log_Level           ${FLB_SERVICE_LOG_LEVEL}

[INPUT]
    Name                forward
    unix_path           /var/run/fluent.sock
    Mem_Buf_Limit       ${FLB_INPUT_MEM_BUF_LIMIT}

[INPUT]
    Name                forward
    Listen              0.0.0.0
    Port                24224
    Mem_Buf_Limit       ${FLB_INPUT_MEM_BUF_LIMIT}

[INPUT]
    Name                tcp
    Tag                 firelens-healthcheck
    Listen              127.0.0.1
    Port                8877
    Mem_Buf_Limit       ${FLB_INPUT_MEM_BUF_LIMIT}

[FILTER]
    Name                record_modifier
    Match               *
    Record              ec2_instance_id ${EC2_INSTANCE_ID}
    Record              ecs_cluster ${ECS_CLUSTER}
    Record              ecs_task_arn ${ECS_TASK_ARN}
    Record              ecs_task_definition ${ECS_TASK_DEFINITION}

[OUTPUT]
    Name                cloudwatch_logs
    Match               *
    region              us-east-1
    log_group_name      /aws/ecs/${ENV_NAME}/${SERVICE_NAME}
    log_stream_prefix   ${LOG_STREAM_PREFIX}-
    auto_create_group   true
    #log_key             log

[OUTPUT]
    Name                 null
    Match                firelens-healthcheck

Dockerfile

ARG DOCKER_BASE_IMAGE
FROM ${DOCKER_BASE_IMAGE}

COPY entrypoint.sh /
COPY fluent-bit.conf /fluent-bit/alt/fluent-bit.conf

CMD ["/bin/bash", "-c", "/entrypoint.sh"]

Build custom docker image by executing:

docker build --no-cache --build-arg DOCKER_BASE_IMAGE=amazon/aws-for-fluent-bit:2.16.1 -t xxxxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/custom-fluent-bit:2.16.1-1.0.0 .

ECS task definition file:

{
  "containerDefinitions": [
    {
      "cpu": 128,
      "environment": [
        {
          "name": "SERVICE_NAME",
          "value": "my-service"
        },
        {
          "name": "ENV_NAME",
          "value": "test"
        }
      ],
      "essential": true,
      "image": "xxxxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/my-service:latest",
      "logConfiguration": {
        "logDriver": "awsfirelens"
      },
      "linuxParameters": {
        "initProcessEnabled": true
      },
      "memory": 256,
      "name": "my-service",
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 0,
          "protocol": "tcp"
        }
      ],
      "volumesFrom": []
    },
    {
      "environment": [
        {
          "name": "SERVICE_NAME",
          "value": "my-service"
        },
        {
          "name": "ENV_NAME",
          "value": "test"
        },
        {
          "name": "LOG_STREAM_PREFIX",
          "value": "test"
        },
        {
          "name": "FLB_SERVICE_LOG_LEVEL",
          "value": "info"
        },
        {
          "name": "FLB_INPUT_MEM_BUF_LIMIT",
          "value": "100MB"
        }
      ],
      "essential": true,
      "image": "xxxxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/custom-fluent-bit:2.16.1-1.0.0",
      "name": "log_router",
      "firelensConfiguration": {
        "type": "fluentbit"
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/aws/ecs/test/my-service",
          "awslogs-stream-prefix": "test-firelens",
          "awslogs-region": "us-east-1"
        }
      },
      "memoryReservation": 100
    }
  ],
  "family": "test-my-service",
  "executionRoleArn": "arn:aws:iam::xxxxxxxxxxxx:role/ecs-firelens-execution-role",
  "placementConstraints": []
}

zhonghui12 commented 3 years ago

The feature is released on EC2 Agent now: https://github.com/aws/amazon-ecs-agent/releases/tag/v1.55.0. For Fargate side, we will continue to work on it and drive it to be supported soon.

Note: It should be the solution for another request: https://github.com/aws/containers-roadmap/issues/1484. Sorry for the wrong information.

zhonghui12 commented 3 years ago

Hello all, we are working on this feature and would like to gather some real user data here. So what is the actual real world values are expected to be setting for this? We are considering set 256 MB memory as a max value in Fargate and wonder if it will work in your user case. Please leave some comments here if possible. Thanks!

Note: On EC2 Agent, it has been released and has no limit right now.

Note: This should be a part for another request: https://github.com/aws/containers-roadmap/issues/1484. Sorry for the wrong information.

zhonghui12 commented 3 years ago

Sorry for the confusion above. Above release update is for a fluentd log driver option so it is not related to this request. For this request, mem_buf_limit is a fluent bit config option. I've opened a new issue to track the request I am working on: https://github.com/aws/containers-roadmap/issues/1484.

Thanks for the understanding.

farazhv commented 2 years ago

In this AWS doc the option log-driver-buffer-limit has been used with the image aws-for-fluent-bit:stable. It says its a Fluentd buffer limit but it uses the fluent-bit image. Does this option internally use mem_buf_limit?

farazhv commented 2 years ago

@PettitWesley Can we use the Throttle filter in fluent-bit if we set the retry limit to no_retries since that would be a simpler alternative to implement than creating a custom docker image?

PettitWesley commented 2 years ago

@farazhv the log driver buffer limit is entirely different. Mem_Buf_Limit is for the buffer inside of Fluent Bit. The log driver buffer is before that. In FireLens, there are a series of buffers: https://github.com/aws-samples/amazon-ecs-firelens-under-the-hood/blob/mainline/generated-configs/fluent-bit/generated_by_firelens.conf

app stdout/stderr => container runtime buffer (1) => fluentd log driver buffer buffer (2) => Fluent Bit forward input => Fluent Bit internal buffer (3) => log destination

log-driver-buffer-limit is #2 there. We don't support configuring #1 right now.

https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/log-driver-buffer-limit

PettitWesley commented 2 years ago

@farazhv

Can we use the Throttle filter in fluent-bit if we set the retry limit to no_retries since that would be a simpler alternative to implement than creating a custom docker image?

I am not sure what you are looking for here. What is your use case/goal?

PettitWesley commented 2 years ago

For the feature request in this issue, I am now thinking we will just may be contribute this instead: https://github.com/fluent/fluent-bit/discussions/5711

farazhv commented 2 years ago

@farazhv

Can we use the Throttle filter in fluent-bit if we set the retry limit to no_retries since that would be a simpler alternative to implement than creating a custom docker image?

I am not sure what you are looking for here. What is your use case/goal?

Firelens runs as a sidecar to the application container in a Fargate task. Logs emitted to stdout by the application container are sent by Firelens to CloudWatch and DataDog. My goal is to prevent an OOM from taking down the task. It is acceptable to lose application logs if I have a guarantee the task won’t go down due to an OOM in Firelens.

PettitWesley commented 2 years ago

@farazhv I think what you want is this tutorial: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention

Also remember that you can set the FireLens container as non-essential so that if it fails then it won't take down the task.

Also check out our health check guide: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/health-check

farazhv commented 2 years ago

FluentBit has a limited set of buffers constrained by its memory. With memory available, it will output the log messages it receives. However, if it isn't able to output messages or if the inflow of messages is very high, then the buffers in it will reach capacity resulting in an OOM crash.

If I want to revise my original use case to introduce resilience such that if I need FluentBit to drop messages in case, it reaches a hard memory limit, would it be safe to configure the throttle filter and disable retries?

This is based on the assumption that the throttle filter will prevent the buffers from reaching capacity if the inflow rate is very high, and disabling retries will prevent them from reaching capacity if the output is unsuccessful.

PettitWesley commented 2 years ago

would it be safe to configure the throttle filter and disable retries?

@farazhv I think this would work. I have not actually tested it or seen it used in production though, so this is a hypothesis not something that is proven. But the thinking makes sense, if Fluent Bit via the throttle filter is limited in the rate which it can accept logs, and no retries are configured meaning that any issues at the output does not really lead to backpressure... then the memory usage should not be able to increase much.

adrian-skybaker commented 2 years ago

UPDATE: We will likely work with Fluent Bit upstream to contribute this instead: https://github.com/fluent/fluent-bit/discussions/5711

Fluentbit already supports Mem_Buf_Limit on INPUT today.

Rather than extend fiuentbit itself to have a new parameter, it seems like the simpler gap to plug is that it's not possible to control the [INPUT] that firelens generates, both for this setting and others ("We should determine if there are any other input options as well which might need to be configured.").

This limitation rears its head with https://github.com/aws/aws-for-fluent-bit/blob/mainline/use_cases/init-process-for-fluent-bit/README.md, you can import additional fluentbit config files from S3, but as far as I can see you're stuck with controlling the generated INPUT via firelens options.

PettitWesley commented 2 years ago

@adrian-skybaker Which input settings are you interested in setting? Which are highest priority? Which are critical?

IMO, the critical [INPUT] are just Mem_Buf_Limit and storage.type. If we had a global [SERVICE] level settings for those, then that should satisfy most use cases for setting input parameters I think.

CC @matthewfala

adrian-skybaker commented 2 years ago

IMO, the critical [INPUT] are just Mem_Buf_Limit and storage.type. If we had a global [SERVICE] level settings for those, then that should satisfy most use cases for setting input parameters I think.

Yes I think so. Mem_Buf_Limit is the one I'm trying to set at the moment (as part of trying to retire a custom firelens image). We don't set any others currently, but I'm only one data point : )

However... it still seems unfortunate that if you want absolute control over this input, the only choice will still be a custom image, even with the new init process that allows supplementary config from S3 includes.

Perhaps ultimately its just that I'm trying to workaround the lack of S3 custom config source for Fargate, but an option to completely suppress this INPUT would mean I could redeclare it myself with full control.

adrian-skybaker commented 2 years ago

If we had a global [SERVICE] level settings for those, then that should satisfy most use cases for setting input parameters I think.

That's all well and good, but fluentbit doesn't support that today. Whereas it does support setting these on INPUT, it's just not controllable via firelens. You can also imagine a scenario where even if it was supported, I might want a specific value set for stdin, but a different service level setting (eg for several other tail inputs).

But perhaps this all comes back to the same limitation that once you want to have control over this input, you have to stick with a custom image (which is a very clunky way of passing some .conf files).

PettitWesley commented 2 years ago

@adrian-skybaker I agree, most ideal solution is that you could pass in arbitrary parameters to the generate input. I still think in that case that for Fluent Bit in general there are user experiences reasons to have a global [SERVICE] level settings for these. A lot of our users want a simpler experience for configuring logging and I think both in EKS and ECS that a single Fluent Bit setting for the type and size of all buffers would be convenient.

adrian-skybaker commented 2 years ago

A lot of our users want a simpler experience for configuring logging

Yes I agree.

This feedback is a bit off-topic for this issue, but after using https://github.com/aws/aws-for-fluent-bit/blob/mainline/use_cases/init-process-for-fluent-bit/README.md for a day or so, my view is that you end up with quite a messy hybrid trying to combine firelens-config-generated fluentbit conf with hand-crafted fluentbit conf, with several surprises and limitations.

IMO a simpler mental model is either 100% firelens config, with higher level options like the global buffer setting you mention, existing streamlined cloudwatch etc, or 100% self-managed fluentbit conf files (supported by some niceties like helpful plugins and env vars being available). Of course the latter is available today, it just requires your own container.

rnlduaeo commented 1 year ago

any updates on custom config for fargate ecs? I've tried to create firelens custom image to prevent OOM kill (set storage type to filesystem) and use that with my application container. Log should be forward to kinesis streams. However after setting up custom config, the log is sending to nowhere...; task is started successfully. So I came back to default config with original firelens image.. then log is successfully forwarded to kinesis streams. Any idea to debug this?

PettitWesley commented 1 year ago

@rnlduaeo Can you please fully describe your issue and submit task def, Fluent bit config, and logs to an issue here: https://github.com/aws/aws-for-fluent-bit/issues

PettitWesley commented 1 year ago

For others, here is our oomkill guide: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention

This guide explains all of the different buffer settings, and ways they can be used. Please read.

acm19 commented 9 months ago

Hi @PettitWesley, thank you for sharing this guide. It looks like it basically highlights the importance of being able to at least allow Mem_Buf_Size and storage.type configuration for a production ready set up. The only way to do that is to manage your own image, which introduces added complexity, and somehow defeats the purpose of using FireLens. You also might need to add ec2_instance_id, ecs_cluster, ecs_task_arn and ecs_task_definition to your custom filters, probably via env vars.

My point is, not allowing for this settings at input level make this solution not production ready as we end up with 3 options if we want to ship logs from Fargate:

Run Firelens as design and be expose to OOM, log lost, etc.
Limit the memory of the shipper and mark the container as non-essential, then I guess that it will need manual intervention to fix it, and also there will be log lost.
Manage our own images, increase complexity and management burden, renounce to all the documentation and standardisation around Firelens.

I understand how it is a pain to allow certain configs in the INPUT section and not others, so I wonder, have you considered the option of allowing to override the whole INPUT? What I mean is, in the FirelensConfiguration section have an extra flag similar to enable-ecs-log-metadata, something like override-input: true, in that case, you don't include the INPUT section in the generated fluent.conf but let the user do it in the config-file-value file.

PettitWesley commented 8 months ago

@acm19 I understand the difficulties you are facing.

The simplest and fastest to implement solution to this problem (which also means it has the best chance to actually get released), would be to modify the ECS init AWS for Fluent Bit image: https://github.com/aws/aws-for-fluent-bit/tree/develop/use_cases/init-process-for-fluent-bit

Its an image that we vend, its just another tag, as explained in that link. Please then also view the 3 examples we have for it here:

https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline?tab=readme-ov-file#aws-for-fluent-bit-init-tag-examples

The init tag allows you to specify additional configuration files as env vars in your task definition. The image will use these config files, they can either be built into the image, mounted into the container at runtime, or pulled from S3. Init also supports injecting ECS metadata as an env var, so you get that support.

Currently, the init tag will also always include the main FireLens generated input config file: https://github.com/aws/aws-for-fluent-bit/blob/develop/init/fluent_bit_init_process.go#L388

I'm thinking I could simply add a new env var, like aws_fluent_bit_init_ignore_firelens_config, which will make it ignore the main config file. With that, you'd then have the ability to fully control your config and still inject ECS metadata.

What do you think?

PettitWesley commented 8 months ago

Should the value of the aws_fluent_bit_init_ignore_firelens_config var matter?

Options:

The presence of the env var disables including the generated firelens config, and its value doesn't matter. You can choose On, Off, true, True, False, etc and it will still turn on. Folks might be confused if they try to disable it by setting it to false. Its also convenient though that you don't have specific a specific string like true
The name and the value must match. The values can be true, or on, case insensitive.

acm19 commented 8 months ago

Hi @PettitWesley, thank you for your reply. I see some major disadvantages of using this image in a production environment:

It's big compared to the fluentbit image, it installs AWS plugins that we might not need. I ran some tests a few month back and the AWS vended versions would consume roughly twice as much memory for application with very low log generation (not rigorous testing).
The security surface is also bigger if we use this image.
Fluentbit version doesn't keep up to day with the version.
It's hard to get the tag up to day automatically (via Renovate for example), maybe there's parameter store with the latest version. But ideal case scenario, changes should reflect in commits.

So, even though it'd be very useful for testing configurations quickly, I'd probably consider more sensible to manage a custom image for production applications.

Should the value of the aws_fluent_bit_init_ignore_firelens_config var matter?

I personally prefer option 2.

aws / containers-roadmap