Multiline log guidance - Githubissues

HaroonSaid commented 4 years ago

We have the following configuration

{
      "essential": true,
      "name": "log_router",
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "enable-ecs-log-metadata": "true"
        }
      }
      "memoryReservation": 50,
      "image": "906394416424.dkr.ecr.${AWS_REGION}.amazonaws.com/aws-for-fluent-bit:latest"
    },

We want to have multiline logs for stack trace etc.
How should I configure fluentbit

PettitWesley commented 3 years ago

Fluent Bit unfortunately does not yet have generic multiline logging support that can be used with FireLens. We are planning to work on it. For now, you must use Fluentd: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluentd/multiline-logs

belangovan commented 3 years ago

@zhonghui12 , @PettitWesley we are using firelens configuration with aws-for-fluent-bit for multi-destination log routing which includes cloudwatch as one of the sources. Multiline log grouping we need to make the most out of our logs. whether is any custom parser we can use to achieve it also fine.

PettitWesley commented 3 years ago

@belangovan there has been no change in guidance since my last comment on this issue. Fluent Bit still only have multiline support that works when tailing a log file. It does not have generic multiline support that works with FireLens. We are planning to work on that some time in the next few months. Until then, you have to use Fluentd for multiline.

HaroonSaid commented 3 years ago

Is this feature you guys are planning on working soon?

Just want to know on how to plan for our organization

Do we switch to fluentd or wait. If we wait - how long

PettitWesley commented 3 years ago

@HaroonSaid We have begun investigation for this project. We hope to get it launched within 2 months, however, there are no guarantees.

corleyma commented 3 years ago

@PettitWesley Any updates re: whether this project is launching as intended? Debating whether we have to change an internal logging system to support fluentd or if we can wait for fluentbit multiline support to land.

PettitWesley commented 3 years ago

@corleyma The upstream maintainers are working on it apparently- I've been told that it should be ready/launched sometime in May.

silvervest commented 3 years ago

@PettitWesley May... this year? Any update on this? It'd be a very useful feature for us.

PettitWesley commented 3 years ago

@silvervest Yeah it was supposed to be May of this year. Progress has been made upstream but the launch is delayed till sometime in June.

PettitWesley commented 3 years ago

This is launching very soon: https://github.com/fluent/fluent-bit/issues/337#issuecomment-882953961

aaronrl95 commented 3 years ago

Just to clarify, is the multi-line support now available for use in this image? Or are we still awaiting that implementation?

hossain-rayhan commented 3 years ago

Hi @aaronrl95, it was included in v2.18.0.

aaronrl95 commented 3 years ago

Ah great, thank you. Could you point me to the documentation around implementing that feature in our firelens configuration? I'm struggling to find any

hossain-rayhan commented 3 years ago

You can follow this Firelens example.

aaronrl95 commented 3 years ago

@hossain-rayhan thank you for that, that's just what I'm looking for

vinaykrish25aws commented 3 years ago

@hossain-rayhan Does this solution also applicable for JSON format logs produced by Docker container ?

hossain-rayhan commented 3 years ago

@hossain-rayhan Does this solution also applicable for JSON format logs produced by Docker container ?

@zhonghui12 or @PettitWesley can you answer this?

zhonghui12 commented 3 years ago

@hossain-rayhan Does this solution also applicable for JSON format logs produced by Docker container ?

@zhonghui12 or @PettitWesley can you answer this?

I assume that if the JSON format logs are split into multiple lines, then it can be concatenated as there is no obvious limit here: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace. But maybe @PettitWesley can give a more certain answer here.

Or maybe we should help to test it out.

StasKolodyuk commented 3 years ago

@hossain-rayhan @zhonghui12 @PettitWesley hi guys, I've been trying to use multiline support to concat partial messages splitted by containerd (AWS Fargate), however it didn't work. I've been using approach described by @hossain-rayhan with the following config:

[SERVICE]
    Flush 1
    Grace 30
    Log_Level debug

[FILTER]
    name                  multiline
    match                 *
    multiline.key_content log
    multiline.parser      cri, docker

Could you please take a look, thanks!

More details on my setup and what I'm trying to achieve: I have a Spring Boot App that logs to stdout using Logstash-logback-encoder to log in JSON format (one JSON log entry per line). There's a JSON field called "stack_trace" that may be very long. When the log line is longer > 16k chars (which usually occurs for a stack trace), containerd (AWS Fargate 1.4 runtime) splits it into several parts. Then Fluent bit receives those JSON parts. At this point I'd like Fluent Bit to merge them and parse as JSON. However, as I said, this is what I fail to get working right now

PettitWesley commented 3 years ago

@StasKolodyuk you to create a custom multiline parser I think. I don't know exactly how to solve this use case with the new multiline support. I suspect with a custom parser with a custom regex it should be possible.

https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace

PettitWesley commented 3 years ago

@vinaykrish25aws Yes the new filter will work with json logs from Docker. In that case, the log content is in the log key and you specify that key in the filer:

    multiline.key_content log

If the content of that key is itself nested json that need to be recombined or something then that's a more complicated use case which might need custom parser and/or additional parsing steps.

opteemister commented 3 years ago

Hi, I have similar problem. We also have json logs split by docker running on AWS Fargate cluster. I don't think that json is really matters here because it is just a string. But even with mutiline filter - fluentbit can't concatenate such logs. I double checked that our logs have log key and all configurations are as same as in documentation.

shijupaul commented 3 years ago

Following configuration is not working for me, not merging java stack-trace to single entry. Any thoughts?

Dockerfile Screenshot 2021-08-18 at 17 30 59

parsers_multiline.conf Screenshot 2021-08-18 at 17 31 12

extra.conf Screenshot 2021-08-18 at 17 31 25

Section from Task Definition Screenshot 2021-08-18 at 17 34 58

PettitWesley commented 3 years ago

@shijupaul Unfortunately, since this feature is new, we are still learning and understanding as well, and there are very few working examples that we have as well... so right now everyone is figuring it out.

So actually, if you or anyone in this thread get a working example for a use case you think is decently common, please do share it. This will benefit the community. I'm also slowly working on slowly improving our FireLens/Fluent Bit FAQ/examples, and this data can be used for that.

Can you share what these java stack traces look like? And I recommend that you (and everyone) test their own logs with the regular expressions that you write in the multiline parser using the rubular website: https://rubular.com/

If the regex's don't work there with your logs... then that's the problem. That should be your first debug step.

lbunschoten commented 3 years ago

Hello 👋 I thought I'd share my attempts as well here, as it might be useful to someone. I've been trying to get this to work for a couple days now as well, but so far without any luck. I have a pretty much identical setup as @shijupaul (I don't have the grep filter). I've playing around with these regexes quite a bit, but it doesn't seem to have any effect at all. Even if I put in a regex like /.*/ for both rules, you don't see any difference in the end result. I am getting the feeling now that problems is elsewhere to be honest.

To verify my hypothesis, I have been trying a couple of things:

I've verified that my custom image is actually picked up -> Hash of the image on fargate matches my local version
Tried breaking the conf file on purpose by removing the [SERVICE] block -> task failed to start, so the conf file is picked up
Tried using a ton of different regexes, including crazy things like /.*/, no change in the outcome
Tried removing the multiline.key_content, no change either
Tried setting a larger duration for the flush -> no change either

I also ran it locally using fluent-bit -c multiline-parser.conf. I tried to mimic the fargate config, but used a tail input instead:

[SERVICE]
    Parsers_File parsers.conf
    Flush 1
    Grace 30

[INPUT]
    name              tail
    path              log.txt
    read_from_head    true

[FILTER]
    name                  multiline
    match                 *
    multiline.key_content log
    multiline.parser      multiline-regex-test

[FILTER]
    Name                  parser
    Match                 *
    Key_Name              log
    Parser                json
    Reserve_Data          True

[OUTPUT]
    name                  stdout
    match                 *

The interesting thing is that there I do see that it has as an effect. I can see how multiple log lines are combined. I have a couple of theories now:

The multiline.key_content field is not supposed to be log, but something else. I don't have access to the raw logs yet, so it is a bit hard to verify.
The multiline_parser does not work with the forward input for some reason.

Any tips or tricks are appreciated! In the meantime, I'll keep debugging

f0o commented 3 years ago

@lbunschoten @PettitWesley This is what I experienced as well...

Correct me if I'm wrong but I believe that the issue is the Source of the logs - Our images only get it as Forwarded messages from the emitter (https://github.com/aws/aws-for-fluent-bit/blob/mainline/fluent-bit.conf#L1-L4).

This might make it pointless to try to concat it through the use of metadata (like CRI's logtag or Docker's partial_message) because those could be filtered out or not forwarded to us in the first place.

That would match our experienced behavior here.

lbunschoten commented 3 years ago

Yeah, I can see how those CRI and docker metadata options might have problems with a broken JSON structure, but you would still expect the regex solution to work, right?

I am going to try to use this forward plugin locally as well now, to see if it may be related to the input

f0o commented 3 years ago

@lbunschoten to be honest I have no idea how the regex is supposed to be working but I agree that it should be working with regex

As for the missing metadata it could be related to https://github.com/fluent/fluent-bit/issues/1072 - I mean, if that's even the case and if they even use fluentbit at the core-level...

At this stage it's just a lot of speculation and assumptions and I feel like we're all just wildly testing out things blindly to hope for a different result haha

f0o commented 3 years ago

@PettitWesley according to https://github.com/aws-samples/amazon-ecs-firelens-under-the-hood/blob/mainline/generated-configs/fluent-bit/README.md the logs from the docker fluentd log-driver but Fargate doesnt use Docker anymore:

One of the changes we are introducing in platform version 1.4 is replacing Docker Engine with Containerd as Fargate’s container execution engine.

https://aws.amazon.com/blogs/containers/under-the-hood-fargate-data-plane/

So what's being used there?

opteemister commented 3 years ago

@lbunschoten I had very similar experience and thoughts. Here is related comment in a different thread

So Fargate -> aws-for-fluent-bit -> some output - doesn't concatenate logs even with /.*/ pattern But Fargate -> aws-for-fluent-bit -> another one fluent-bit (with the same version and configs) -> some output - concatenate all logs.

I also was able to see raw logs coming from Fargate 1.4 via Firelens. I used just forward raw logs to the output in aws-for-fluent-bit configs. But log key was there. So from that standpoint everything looks well, but I probably missing something.

lbunschoten commented 3 years ago

I just tried to get it to work locally using the forward plugin, but I didn't manage :( If I swapped the forward plugin with the tail plugin, it was working just fine, so I think it is a combination of the forward input with the multiline_parser, but it might also just be my limited knowledge of fluent-bit

@opteemister That's very interesting (and odd) that it did work if you used the fluent-bit part twice. Do you still happen to have the configs and would you mind sharing them?

opteemister commented 3 years ago

My configs were pretty the same as in a multiline parser example and in this comment

I was just playing with different custom regexp rules and even without using parsers_multiline.conf at all. Only docker and cri default multiline.parsers.

f0o commented 3 years ago

@PettitWesley I tried using a custom fluentd (instead of bit) image but I keep getting:

Stopped reason InternalError: unable to generate fireLens config file: unable to generate fireLens config content: unable to generate fluent config output section: unable to apply log options of container log-split to fireLens config: missing output key @type which i...

It's very annoying that the ECS Console cuts of the error right where you need it the most...

But that being said, I did basically https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluentd/multiline-logs.

What's missing? What output key?

Shouldnt the output be generated from:

            LogConfiguration:
              LogDriver: awsfirelens
              Options:
                Name: forward
                Host: My-Fluentd-Host
                Port: "24225"

//EDIT:

Looking at https://github.com/aws/amazon-ecs-agent/blob/master/agent/taskresource/firelens/firelensconfig_unix.go#L226 made me aware that I should call the Option @type instead of Name... Confusing and couldnt find it in the dos (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_firelens.html) but ok.

//EDIT2: Nope still same error...

            LogConfiguration:
              LogDriver: awsfirelens
              Options:
                "@type": forward
                Host: fluentd-host
                Port: "24225"

//EDIT3: Removing the Options from the block and hardcoding an output into the extra.conf for fluentd didnt change anything either, same error... I dont think the parser is smart enough to reconstruct <server> blocks needed so I guess @type forward is not supported?

Any way to provide it something to keep it shut and just run my supplied config?

//EDIT4: Nailed it.

            LogConfiguration:
              LogDriver: awsfirelens
              Options:
                "@type": stdout

//EDIT5: fluent/fluentd:latest breaks Firelens because they got a magic entrypoint that creates the user etc so mounting the socket for fluentd fails with ECS console error: CannotStartContainerError: ResourceInitializationError: unable to create new container: mount callback failed on /tmp/containerd-mount654170867: no users found

f0o commented 3 years ago

Right so I cant seem to get Docker nor CRI partial tags when I debug through fluentd... (piping everything from the socket outwards)

So I guess that's the culprit here...

Maybe @PettitWesley can peek behind the scenes how the forwarder to firelens handles the logs, so the process that pushes them into the socket for firelens container to load them

lbunschoten commented 3 years ago

I've been able to verify locally as well what @opteemister said. Having 2 fluent-bit services running in a row does indeed "fix" the concatenation of the logs. That's however not really what I'd like to run on production (if it is even possible).

Perhaps this gives @PettitWesley a clue at what the problem might be. No pressure ;)

PettitWesley commented 3 years ago

@f0o

@PettitWesley according to https://github.com/aws-samples/amazon-ecs-firelens-under-the-hood/blob/mainline/generated-configs/fluent-bit/README.md the logs from the docker fluentd log-driver but Fargate doesnt use Docker anymore:

On Fargate we actually still use the docker code in a wrapper: https://github.com/aws/amazon-ecs-shim-loggers-for-containerd

PettitWesley commented 3 years ago

A lot of comments here... someone from my team will take a look.

A reminder that this is how FireLens works: https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

PettitWesley commented 3 years ago

@f0o You don't have to specify the log configuration options. You can just fully specify the output in the extra config file. And then the log configuration is just:

LogConfiguration:
              LogDriver: awsfirelens

I recommend this style.

paul5-elsevier commented 3 years ago

@PettitWesley I think something is going wrong when the effective configuration is created by aws FireLens and applied to the sidecar container.

FluentBit container alone can work fine and can parse the log correctly including stack trace To test this behaviour I have used the following configuration DockerFile

fluent-bit.conf

parsers_multiline.conf

When I run the container and inspect the logs, I can see that stack trace is processed correctly.

However using the configuration mentioned in my previous post with FireLens it doesn't. Each line in the stack trace will get pushed as separate entry. See the screenshot below

Our application is a standard Sprint Boot application, and the stack trace created are standard, setup also includes collecting the logs and pushing to ElasticSearch.

Finally I got it working with FluentD, and the stack trace is grouped correctly, and single entry is pushed to ES

PettitWesley commented 3 years ago

@paul5-elsevier Did you create a container image that outputs the test.log file line by line to stdout for use in the firelens task definition? Can you share the Dockerfile for that with me and then I'll try to repro myself.

paul5-elsevier commented 3 years ago

@PettitWesley Both @shijupaul and @paul5-elsevier are my accounts.

test.log file has been used to test fluent-bit in isolation. In our deployed environment, our app container write log to sysout or syserr and is configured to use ElasticSearch as destination Screenshot 2021-08-24 at 09 00 30

Our Sidecar container is configured to use Fluent-bit and has the following configuration Screenshot 2021-08-24 at 09 00 51

Let me know if you need any more information.

Changes I have tried are pushed to the fork (https://github.com/paul5-elsevier/amazon-ecs-firelens-examples), and it's under the branch feature/multiline-processing

marksumm commented 3 years ago

@PettitWesley I agree with @paul5-elsevier, meaning that whatever breaks the multiline filter in aws-for-fluent-bit only happens when running in AWS. The same parser configuration running locally using the latest aws-for-fluent-bit image from ECR works as expected when using either the tail or forward inputs. In the latter case, I used a second instance of the aws-for-fluent-bit container to tail a log and output it using the forward protocol. The only time I saw anything similar running locally was when I used the head input, which seems to struggle with very long lines, even if the buffer size is increased. I also configured aws-for-fluent-bit running in AWS to dump the incoming log data to stdout, meaning that it ends up in CloudWatch. From there, I can see that each incoming message has its contents stored in a field called log and split messages are simply represented as two consecutive messages where the contents of log should be concatenated, which is exactly what my multiline filter is intended to do. Given that the log data is visible in its entirety (albeit split into two message) it seems that it is not being truncated on the way into aws-for-fluent-bit, so I am left wondering about the effective runtime configuration being used by aws-for-fluent-bit when running in AWS.

opteemister commented 3 years ago

I'm not sure whether it is related or not but after this comment above: On Fargate we actually still use the docker code in a wrapper: https://github.com/aws/amazon-ecs-shim-loggers-for-containerd I noticed that there are hard limits for logs' length in a wrapper too. (Used same sizes as Docker for splitting logs)

marksumm commented 3 years ago

@opteemister Which limits are you referring to? I noticed the default max-buffer-size of 1m, but given that splitting seems to happen at the 16 KiB container runtime limit, this buffer size should already be sufficient. My best guess right now is that regular expressions in multiline parser rules are somehow mangled when user-supplied config is injected into aws-for-fluent-bit running in AWS, causing them to never match. When running locally, I have the luxury of being able to completely replace the runtime configuration.

opteemister commented 3 years ago

Here are several places that I found: https://github.com/aws/amazon-ecs-shim-loggers-for-containerd/blob/master/logger/common.go#L44 https://github.com/aws/amazon-ecs-shim-loggers-for-containerd/blob/master/logger/common.go#L50

Probably line 50 could be related somehow. But I'm still not sure. If there is: container logs -> Wrapper(Firelens) -> fluent-bit - then it shouldn't be related because all limits are before fluent-bit But if there is (somehow): container logs -> fluent-bit -> wrapper - then it can be related.

But your assumption about that regular expressions in multiline parser rules are somehow mangled make sense. I had the same conclusion, but didn't have any strict facts of it.

I was thinking why that Wrapper can't concatenate logs by itself? https://github.com/aws/aws-for-fluent-bit/issues/25#issuecomment-907748568

marksumm commented 3 years ago

@opteemister Thanks for the pointers. Since my last message, I modified the command of the aws-for-fluent-bit container running in AWS to output all of the generated config and I can see that user-supplied config is simply inserted via @INCLUDE after some inputs are defined and metadata fields have been added using a record modifier filter. The strange thing is that if I replicate this configuration locally, then it still works as expected. The only thing I changed is replacing an ES output with a stdout one.

jawon-benchling commented 3 years ago

Hi, Reading through this bug, is it fair to say that multi-line log parsing on Firelens + Fluentbit... just doesn't work? I'm trying it out on our services, and these are the configurations: (we forward to Sumo Logic) https://gist.github.com/jawon-benchling/1b991f01c533aaf8d9505f26e265c850

Not having multi-line log parsing is a dealbreaker for us.

Thank you!

PettitWesley commented 3 years ago

@marksumm

From there, I can see that each incoming message has its contents stored in a field called log and split messages are simply represented as two consecutive messages where the contents of log should be concatenated, which is exactly what my multiline filter is intended to do. Given that the log data is visible in its entirety (albeit split into two message) it seems that it is not being truncated on the way into aws-for-fluent-bit, so I am left wondering about the effective runtime configuration being used by aws-for-fluent-bit when running in AWS.

As others noted, this is probably because most container runtimes that I know of (both Docker and containerd with the shim loggers that we use in Fargate) truncate logs at 16KB. Is this what you are seeing?

If so, then I'm not certain if the new multiline feature can help re-concatenate these split logs. We are tracking this internally though as feature gap. And we have this old issue: https://github.com/aws/aws-for-fluent-bit/issues/25

As part of this we need to fix the fact that Fargate PV 1.4 does not set the partial message indicator.

marksumm commented 3 years ago

@PettitWesley I'm afraid that you've completely missed the point and I have already mentioned the 16KiB limit. The limit causes messages to be split into multiple parts, which is not truncation. Ideally, this would be handled transparently on the AWS side. However, given that this does not currently happen, several of us have attempted to work around the issue by using the multiline parsing feature of Fluent Bit. Please note that this doesn't need to rely on any special container runtime metadata, as simple awareness of the content of a typical log message (for example, if the first line starts with a timestamp) is enough to form the basis of a regular expression parser.

I already proved that both parts of a split message are arriving at Fluent Bit while running in AWS, with each part being stored in the "log" field of a separate event. Fortunately, the Fluent Bit multiline parser is able to operate on fields as well as raw messages.

Now, the example multiline parser configuration from Fluent Bit was already copied into AWS Firelens examples, which implies that it should work. However, I doubt very much that it does. Any multiline parser configuration I successfully test locally mysteriously stops working when it is deployed to AWS.

PettitWesley commented 3 years ago

@marksumm

Please note that this doesn't need to rely on any special container runtime metadata, as simple awareness of the content of a typical log message (for example, if the first line starts with a timestamp) is enough to form the basis of a regular expression parser.

Good point.

I wanted to note though that the partial_message flag set by the runtime option is the most fully generic approach, which will solve all use cases. And thus I am attempting to get that prioritized.

Separately, there is still this issue- that the new multiline feature doesn't work in ECS FireLens. I've added repro'ing this on my TODO list.

Apologies for the inconvenience everyone is experiencing with this; I know many have been waiting for generic multiline support for a long time, AWS will work with the upstream community to get it fully working.

aws / aws-for-fluent-bit

Multiline log guidance #100