aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.2k stars 316 forks source link

[EKS Fargate Logging] [request]: Document Message Line Limits and Add Support for >16kb #1317

Open carusology opened 3 years ago

carusology commented 3 years ago

Community Note

Tell us about your request EKS Fargate Logging currently appears to support a maximum length of 16kb per logged line. I request that this maximum length:

  1. Be documented in a way that is both accurate and updated as it changes over time.
  2. Be configurable to a value larger than 16kb such that large Json messages do not get split into two or more messages. A maximum of 32kb or 64kb would seem more appropriate.

Which service(s) is this request for? This is for EKS Fargate Logging. Specifically, the behavior described in this document.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I have a Spring Boot service written in Kotlin and I wanted to log its output in a "json lines" format (one Json object per line) by leveraging the common JsonLayout Log4j configuration. When exceptions are thrown and logged within the service, the stack trace is usually large enough that the resulting block of Json can be over 16kb. The EKS Fargate Logging worker splits this message in two, leaving the message in two strings that could not be parsed as Json, preventing it from being filtered upon downstream in a log viewing tool such as CloudWatch or Kibana. It's hard to filter within logging tools find these logs as you can't filter on its content via well formed fields as the Json didn't get parsed. Even if you do find the message, you have to manually stitch the messages back together to find out what happened.

Are you currently working around this issue? I swapped our log4j configuration from JsonLayout to JsonTemplateLayout. The latter has a configurable maxStringLength attribute and can "stringify" stack traces to they get emitted as a single string. When I set the maxStringLength to 10000 and set stack traces with stringified: true, the stack traces are now truncated when they are large enough to trigger the splitting behavior. Since none of the other fields seem to total to more than ~6000 characters combined, the splitting of large messages has stopped.

Additional context According to AWS documentation, EKS Fargate Logging is using FluentBit and generates its own [Input] blocks (Source (emphasis mine)):

Validation Strategy

The main sections included in a typical Fluent Conf are Service, Input, Filter, and Output. Service and Input are generated by Fargate. Fargate only validates the Filter, Output, and Parser specified in the Fluent Conf. Any sections provided other than Filter, Output, and Parser are ignored.

I believe these messages are running into the Docker daemon's internal/hardcoded 16kb limit for logged message before it flushes. The docker maintainers expect log parsing tools, such as Fluent Bit, to stitch these piecemeal messages back together again. Fluent Bit actually has an option to do this within the Input blocks using docker_mode (Source):

Docker Mode Configuration Parameters

Docker mode exists to recombine JSON log lines split by the Docker daemon due to its line length limit. To use this feature, configure the tail plugin with the corresponding parser and then enable Docker mode:

Key Description Default
Docker_Mode If enabled, the plugin will recombine split Docker log lines before passing them to any parser as configured above. This mode cannot be used at the same time as Multiline. Off

So I'm guessing the Input blocks generated from EKS Fargate Logging do not have docker_mode enabled. Assuming it is enabled, we'll run into limits around the size of Buffer_Chunk_Size (32kb by default) eventually as well. I have not observed logs being generated over around ~20kb from our service though, so that limit would at least be sufficient for us.

Attachments I've attached three things:

  1. An aws-logging.yaml file that maps to the ConfigMap used to parse EKS Fargate logs.
  2. An example.json JSON log file that the service emitted which is over 16kb.
  3. The end result of the EKS Fargate ConfigMap being applied to a JSON log over 16kb being broken into example-split-first-half.json and example-split-second-half.json.

examples.zip

PettitWesley commented 3 years ago

I believe the length limit comes from docker. The standard solution is to concatenate logs into multilines using something like fluentd's concat plugin: https://github.com/fluent-plugins-nursery/fluent-plugin-concat

Fluent Bit, which is used in EKS Fargate, has multiline processing support in the tail plugin, but we do not currently allow customers to customize any input options. I'm not certain though if its multiline support covers this log truncation use case though.

carusology commented 3 years ago

I agree that this seems to be from Docker.

As for Fluent Bit configuration, can docker_mode be set to on in these auto-generated input options? That seems like it would solve the problem - at least up to the tail input's buffer size.

PettitWesley commented 3 years ago

@carusology I checked and it looks like we are setting docker_mode to On in the built-in input. So now I am confused by this issue...

I think we're using the default buffer size, which is 32KB. So that should be when you see messages truncated, not at 16 KB...

PettitWesley commented 3 years ago

Just to double check- how did you check that its truncating at 16KB?

carusology commented 3 years ago

I checked and it looks like we are setting docker_mode to On in the built-in input. So now I am confused by this issue...

Nuts! I was hoping it was that simple. 😞 I, too, am confused with what is causing this truncation then. That seemed like a probable cause without having visibility into the source code based upon the behavior I was experiencing.

Just to double check- how did you check that its truncating at 16KB?

Fair point. No hacked test here - I ran this using EKS Fargate Logging. The output I've included is literally what I got from CloudWatch / Kibana (latter is downstream of a Kinesis output) when I got an unhandled exception in a Spring Boot app.

Check out my examples.zip file from the report. You'll see my EKS Fargate Logging ConfigMap and example.json file that I got from a normal JsonLayout config. I also included what a similar message looks like when it gets split into two. The source example.json message is ~20kb, but none-the-less it was split. The two split halves total to 23kb even with all the other decoration fluentd applies via my EKS Fargate Logging configuration.

You could reproduce this by emitting the contents of my example.json file directly from a container with an [Output] to CloudWatch via its EKS Fargate Logging configuration.

rsumukha commented 3 years ago

EKS Fargate uses containerd instead of docker. Setting the docker_mode to on won't be helpful as Fluent-bit expects the logs to be in json whereas the containerd writes raw log line.