aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
461 stars 134 forks source link

Crashes with "Could not find sequence token in response: response body is empty" #161

Open alexmbird opened 3 years ago

alexmbird commented 3 years ago

Hi folks,

We've been running the 2.12.0 release to ship our logs to CloudWatch with the new cloudwatch_logs plugin. We've been waiting for the fix to renewing STS tokens so this is our first outing with it.

After running reliably for several hours, several of our pods have crashed with:

[2021/03/09 22:21:54] [ info] [output:cloudwatch_logs:cloudwatch_logs.5] Created log stream fluent-bit-z64wb.application.<redacted>.log
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/09 22:22:55] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/09 22:22:56] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[2021/03/09 22:23:02] [ warn] [engine] failed to flush chunk '1-1615328565.648760323.flb', retry in 8 seconds: task_id=2, input=tail.5 > output=cloudwatch_logs.5 (out_id=5)

After that it exits with an error status and Kubernetes replaces the pod.

Curiously, several replicas of fluentbit failed with the same error at once. This makes me wonder if the CloudWatch API was briefly unavaiable. But if so, I'd expect the behaviour to be that it retries rather than taking down the whole fluentbit replica.

PettitWesley commented 3 years ago

Hey @alexmbird a few questions:

  1. Can you share your full config. The bad file descriptor bit looks a lot like another issue that you can see referenced.
  2. When you say other replicas failed with the same error, is the full output the same? Can you share it if there are any differences? I am wondering if the sequence token failure is the issue or if its a red herring. From the output, it looks like it is correctly retrying after that error.
  3. There's no more error output? It just crashes after that? No stacktrace and no signal information?
alexmbird commented 3 years ago

Hey there,

Can you share your full config

Sure. I'd rather not post it publicly so have you an email address I can use?

When you say other replicas failed with the same error, is the full output the same?

The outputs are similar but not quite identical. Here are ~two~three more examples:

$ kubectl -n cluster-fluentbit logs --previous fluent-bit-csdp4
[2021/03/09 21:46:00] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/09 21:46:00] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 21:46:00] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 21:46:00] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/09 21:46:01] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 21:46:01] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 21:46:01] [error] [input chunk] error writing data from tail.5 instance
[2021/03/09 21:46:01] [ warn] [engine] failed to flush chunk '1-1615326351.539869750.flb', retry in 7 seconds: task_id=1, input=tail.5 > output=cloudwatch_logs.5 (out_id=5)

$ kubectl -n cluster-fluentbit describe po/fluent-bit-csdp4
...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    135

This one explicitly mentions a ServiceUnavailableException...

$ kubectl -n cluster-fluentbit logs --previous fluent-bit-8cpsn
[2021/03/10 12:09:04] [error] [aws_client] connection initialization error
[2021/03/10 12:09:04] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send log events
[2021/03/10 12:09:04] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send log events
[2021/03/10 12:09:04] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send events
[2021/03/10 12:09:05] [ warn] [engine] failed to flush chunk '1-1615378132.172180559.flb', retry in 6 seconds: task_id=1, input=tail.6 > output=cloudwatch_logs.4 (out_id=4)
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inode=229642033 handle rotation(): /var/log/containers/<redacted>.log => /var/lib/docker/containers/<redacted>.log.1
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inotify_fs_remove(): inode=229642033 watch_fd=25
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inotify_fs_add(): inode=229642033 watch_fd=28 name=/var/lib/docker/containers/<redacted>-json.log.1
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inotify_fs_add(): inode=229642035 watch_fd=29 name=/var/log/containers/<redacted>.log
[2021/03/10 12:14:50] [ info] [input:tail:tail.7] inotify_fs_remove(): inode=229642033 watch_fd=28
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] PutLogEvents API responded with error='ServiceUnavailableException', message='The server failed to fulfill the request. Please try again.'
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/10 13:17:53] [ warn] [engine] failed to flush chunk '1-1615382259.778215743.flb', retry in 11 seconds: task_id=4, input=tail.7 > output=cloudwatch_logs.5 (out_id=5)

$ kubectl -n cluster-fluentbit describe po/fluent-bit-8cpsn
...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    135

And this one is broadly similar...

$ kubectl -n cluster-fluentbit logs --previous fluent-bit-njw9l
[2021/03/10 22:02:51] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=299900464 watch_fd=35
[2021/03/10 22:07:33] [ info] [input:tail:tail.5] inode=292557536 handle rotation(): /var/log/containers/<redacted>.log => /var/lib/docker/containers/<redacted>-json.log.1
[2021/03/10 22:07:33] [ info] [input:tail:tail.5] inotify_fs_add(): inode=292557537 watch_fd=37 name=/var/log/containers/<redacted>.log
[2021/03/10 22:08:22] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=292557536 watch_fd=34
[2021/03/10 22:10:10] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Could not find sequence token in response: response body is empty
[2021/03/10 22:10:11] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty

$ kubectl -n cluster-fluentbit describe po/fluent-bit-njw9l
...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    135

There's no more error output? It just crashes after that? No stacktrace and no signal information?

Affirmative - the output ends abruptly

juanmanuelito commented 3 years ago

Hi Folks

I'm having this error too , do you know if there is an other version that works I'm working with latest, and any update in this issue

PettitWesley commented 3 years ago

@alexmbird Sorry that I missed your update.

Sure. I'd rather not post it publicly so have you an email address I can use?

Thing is... I'd also rather not share my work email publicly. Can you DM me on twitter? https://twitter.com/TheWesleyPettit I'm also on Fluent Slack group.

PettitWesley commented 3 years ago

@alexmbird @juanmanuelito Is this issue consistent and repeatable or is it occasional and random? Are you continually getting sequence token errors with this version? Have you tried other versions?

This issue is weird.. we didn't make any code changes recently that affect sequence tokens. I also can't repro it myself.

I have a third report of this exact same issue via an internal system... in that case the customer found that once they tested it again the issue did not return. And they have not experienced it again. The occurrence was around the same time as when @alexmbird saw it.

[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty

This is weird. And it's not clear that it's anything wrong with the plugin code. I am wondering if the CloudWatch Logs API was undergoing some maintenance at this time and there was a brief accidental outage in sequence token processing. That would explain this other report that I have where it was an issue one day, and then the next everything was fine and it never occurred again.

alexmbird commented 3 years ago

Hi @PettitWesley. I can confirm it's still happening to us - I'd say a couple of dozen crashes a day across our cluster, which rather spams our alerts channel!

I've tried to DM you on Twitter but I get this:

Screenshot 2021-03-25 at 10 51 23

I am @alexhewson2 if you'd like to try.

PettitWesley commented 3 years ago

@alexmbird Ok nvm mind on twitter, I don't use it enough and don't know how it actually works. I checked and I realized I did create an email address for aws fluent bit, which I am pretty sure is open to receive external email: aws-for-fluent-bit@amazon.com. Try that.

I can confirm it's still happening to us

Just to be clear- the sequence token error messages are recurring? Or just that fluent bit is crashing?

alexmbird commented 3 years ago

I've sent a copy of the config to aws-for-fluent-bit@amazon.com.

Yes, I'm still getting errors about sequence tokens and shortly after fluentbit crashes. Here's the output from the most recent one, which happened two hours ago.

$ kubectl -n cluster-fluentbit logs  --previous fluent-bit-rvmk6
...
[2021/03/26 07:07:47] [ warn] [output:cloudwatch_logs:cloudwatch_logs.2] [size=367370] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:07:51] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/26 07:07:51] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/26 07:07:51] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/26 07:07:51] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/26 07:07:58] [ warn] [output:cloudwatch_logs:cloudwatch_logs.2] [size=323578] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:07:58] [ warn] [output:cloudwatch_logs:cloudwatch_logs.2] [size=302934] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:08:01] [ warn] [output:cloudwatch_logs:cloudwatch_logs.5] [size=323578] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:08:01] [ warn] [output:cloudwatch_logs:cloudwatch_logs.5] [size=302934] Truncating event which is larger than max size allowed by CloudWatch
[lib/chunkio/src/cio_file.c:961 errno=22] Invalid argument
[2021/03/26 07:08:01] [error] [storage] [cio file sync] error adjusting size at:  tail.6/1-1616742448.432613720.flb
[2021/03/26 07:08:01] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[2021/03/26 07:08:02] [ warn] [output:cloudwatch_logs:cloudwatch_logs.5] [size=267670] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:08:02] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=108007185 watch_fd=172
[2021/03/26 07:08:02] [engine] caught signal (SIGSEGV)

I've examined the output from this pod in more detail and it looks like the "Could not find sequence token" error happened a few times during its lifetime, not just immediately before it crashed. It might therefore be that we're looking at two separate issues.

PettitWesley commented 3 years ago

@alexmbird the sequence token thing is very strange and I am doubtful its the cause of the crash. I also am struggling to figure out how Fluent Bit could cause this problem, seems more likely its the CW API. I am wondering if the crash is actually caused by this: https://github.com/fluent/fluent-bit/issues/2933

I have two ideas to try:

  1. Run Fluent Bit with Valgrind. This should give us a useful stacktrace on shutdown to show why its dying. I uploaded an image which any AWS account can pull here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/valgrind-2_12_0:latest
  2. Try removing the file buffering from your config. It's possible that causes the issue.
ib-ak commented 3 years ago
[2021/03/31 16:38:18] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 16:38:51] [ warn] [net] getaddrinfo(host='logs.us-west-2.amazonaws.com'): Name or service not known
[2021/03/31 16:38:51] [error] [aws_client] connection initialization error
[2021/03/31 16:38:56] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/03/31 16:38:56] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/03/31 16:38:56] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events
[2021/03/31 16:38:56] [ warn] [engine] failed to flush chunk '1-1617208732.125323519.flb', retry in 10 seconds: task_id=7, input=tail.4 > output
=cloudwatch_logs.1 (out_id=1)
[2021/03/31 16:39:01] [ info] [input:tail:tail.0] inotify_fs_add(): inode=38951155 watch_fd=20 name=/var/log/containers/cert-manager-cainjector-
8c8c66cf5-kjhnf_cert-manager_cert-manager-cc057e295d360d5eee8052729576b97e5743e0f20c1d92d85a684547f6b99109.log
[2021/03/31 16:39:01] [ info] [input:tail:tail.0] inotify_fs_add(): inode=19266564 watch_fd=21 name=/var/log/containers/coredns-6548845887-8wsr8
_kube-system_coredns-de13282d1e52792a0c259964f66ba4fd13b06d1d11dea4f9ae3e5aa826cb6118.log
[2021/03/31 16:39:01] [ info] [input:tail:tail.0] inotify_fs_add(): inode=13805768 watch_fd=22 name=/var/log/containers/monitoring-kube-promethe
us-operator-56c844c5b8-r8779_monitoring_kube-prometheus-stack-aff30feca6eac1b631e339d1469f710b254eae6cfe3ce3846121bd5f9a0cb681.log
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.cert-manager-cainjector-8c8c66cf5-kjhnf_cert-manager_cert-manager-cc057e295d360d5eee8052729576b97e5743e0f20c1d92d85a68454
7f6b99109.log in log group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.cert-manager-cainjector-8c8c66cf5-kjhnf_cert-manager_cert-manager-cc057e295d360d5eee8052729576b97e5743e0f20c1d92d85a684547
f6b99109.log
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.coredns-6548845887-8wsr8_kube-system_coredns-de13282d1e52792a0c259964f66ba4fd13b06d1d11dea4f9ae3e5aa826cb6118.log in log
group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.coredns-6548845887-8wsr8_kube-system_coredns-de13282d1e52792a0c259964f66ba4fd13b06d1d11dea4f9ae3e5aa826cb6118.log
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.monitoring-kube-prometheus-operator-56c844c5b8-r8779_monitoring_kube-prometheus-stack-aff30feca6eac1b631e339d1469f710b254
eae6cfe3ce3846121bd5f9a0cb681.log in log group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.monitoring-kube-prometheus-operator-56c844c5b8-r8779_monitoring_kube-prometheus-stack-aff30feca6eac1b631e339d1469f710b254e
ae6cfe3ce3846121bd5f9a0cb681.log
[2021/03/31 16:39:12] [ info] [input:tail:tail.0] inotify_fs_add(): inode=23318123 watch_fd=23 name=/var/log/containers/argocd-application-contr
oller-2_argocd_application-controller-3b3867761391ee260ebdbbf279acaf130c19e44ef43388ef1c4f91cc8cea523c.log
[2021/03/31 16:39:16] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.argocd-application-controller-2_argocd_application-controller-3b3867761391ee260ebdbbf279acaf130c19e44ef43388ef1c4f91cc8ce
a523c.log in log group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:16] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.argocd-application-controller-2_argocd_application-controller-3b3867761391ee260ebdbbf279acaf130c19e44ef43388ef1c4f91cc8cea
523c.log
[2021/03/31 16:39:32] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Could not find sequence token in response: response body is empty
[2021/03/31 17:44:37] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 17:55:13] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 18:21:22] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 18:46:33] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty

looks like its unable to recover from connection error

PettitWesley commented 3 years ago

@abhinavkhanna-sf If your company has a Support relationship with AWS, please consider opening a ticket for the sequence token issue.

I do not currently have any reason to believe it is a bug in Fluent Bit or a connection issue. If Fluent Bit was dropping HTTP responses, we would expect to see that problem in other plugins as well, all core plugins use the same http stack.

PettitWesley commented 3 years ago

Also, one option worth considering is trying out the old cloudwatch plugin and see if you still see the same error. That's the old go plugin, its not as efficient, but if you have time to test I am curious if the same issue occurs with it. If it does, then we know its nothing in the code (go plugin uses entirely different http client) and that it is truly an issue in the CW API.

Just change the name of your output:

[OUTPUT]
    Name cloudwatch
...
alexmbird commented 3 years ago

@PettitWesley I'm trying to run your Valgrind build on our dev environment. Early results look like this:

kubectl -n cluster-fluentbit get po
NAME               READY   STATUS             RESTARTS   AGE
fluent-bit-9bfvq   0/1     CrashLoopBackOff   10         23m
fluent-bit-c9ssg   0/1     CrashLoopBackOff   10         23m
fluent-bit-fd4xc   0/1     CrashLoopBackOff   10         23m
fluent-bit-frz8s   1/1     Running            0          23m
fluent-bit-jpx54   0/1     Running            11         23m
fluent-bit-k2tn4   0/1     CrashLoopBackOff   9          23m
fluent-bit-lmp4g   0/1     OOMKilled          11         23m
fluent-bit-pvpqc   0/1     Running            11         23m
fluent-bit-t6x4h   1/1     Running            0          23m
fluent-bit-wzxrw   1/1     Running            0          23m
fluent-bit-zfn8b   0/1     CrashLoopBackOff   10         23m
fluent-bit-znb6x   1/1     Running            0          23m

I'll get back to you with more once I've made sense of these CrashLoopBackOffs

alexmbird commented 3 years ago

I've emailed you a log from one of the pods. Curiously it isn't printing a stacktrace before exiting, though there's plenty of output from Valgrind at the start to prove it's working.

Following the second suggestion - I've switched back to the regular image and disabled disk buffering entirely with storage.type memory. But that's not a satisfactory long-term solution for us: our applications log about 1TB of data a month and for the brief outages we're already seeing to CloudWatch, it's vital that the clusters are able to store everything without dropping data.

alexmbird commented 3 years ago

FluentBit has run without file buffering for two days without a single crash. I think we've isolated the problem.

Interestingly...

$ kubectl -n cluster-fluentbit logs -l app.kubernetes.io/instance=fluentbit | grep -ci "Could not find sequence token"
0

This is across 30 pods that have been running for two days. I take it to mean either whatever was wrong with CloudWatch's API has been fixed, or that the error was somehow caused by the use of file buffering.

Given our logging volumes disabling file buffering isn't a good long-term solution. I'm already seeing errors like:

[2021/04/03 08:51:31] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1617439891.605614905.flb in plugin cloudwatch_logs.4

...which means log messages really are being thrown away. Not what we need from our production logging system! A quick grep reveals that error has occurred 70 times in the last few days. I'm not sure what part of FluentBit is responsible for handling these file buffers, so is it something you need to look into or should I open an issue in the main repo?

PettitWesley commented 3 years ago

@alexmbird Thank you for the update.

I'm not sure what part of FluentBit is responsible for handling these file buffers, so is it something you need to look into or should I open an issue in the main repo?

Yeah this is not a part of the code that we (AWS) have experience in. Opening an issue upstream is best.

anvouk commented 3 years ago

Same problem here.

Out of 10 fluentbit pods, 8 were stuck with this error. restarting the pods magically fixed everything, but only for a while.

wtchangdm commented 3 years ago

Hello guys, I recently encountered this issue and haven't found a solution.

If you have any information, please consider posting it at https://github.com/fluent/fluent-bit/issues/3477 as well, thanks!

PettitWesley commented 3 years ago

Does anyone see Sequence token was invalid, will retry in their debug log output?

So this sequence token error message happens when the plugin receives a 200 response from the CWL API, but the response body is empty and the next sequence token is not found. This is treated as a success case, since the logs were successfully sent and the API returned 200. The next sequence token would be used in the next request to PutLogEvents. If it uses an older one, then the API should return InvalidSequenceTokenException and it should emit the above debug message.

I have managed to reproduce the Could not find sequence token in response: response body is empty error message, but I don't see the above message for InvalidSequenceTokenException on the next PutLogEvents call. Which is super weird and makes no sense to me.

PettitWesley commented 3 years ago

So I have made some progress in understanding the root cause of this issue. No promises, but we might have a fix by next week.

PettitWesley commented 3 years ago

A stop gap solution is up: https://github.com/fluent/fluent-bit/pull/3605

PettitWesley commented 3 years ago

My fix for this has been released in Fluent Bit 1.7.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.16.0

psyhomb commented 3 years ago

After deploying version 2.16.0 I'm seeing a lot of these errors in the log, is this something I should be worried about or this is expected behavior? As far as I can tell log events are successfully forwarded to configured destinations.

[2021/06/19 10:22:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
[2021/06/19 10:22:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/06/19 10:22:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2021/06/19 10:22:21] [ warn] [engine] failed to flush chunk '1-1624098140.919796085.flb', retry in 7 seconds: task_id=0, input=forward.0 > output=cloudwatch_logs.0 (out_id=0)

[2021/06/19 10:30:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
[2021/06/19 10:30:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/06/19 10:30:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2021/06/19 10:30:21] [ warn] [engine] failed to flush chunk '1-1624098621.197417164.flb', retry in 10 seconds: task_id=0, input=forward.0 > output=cloudwatch_logs.0 (out_id=0)

Thanks

PettitWesley commented 3 years ago

@psyhomb This is expected. The results of my investigation showed that the old sequence token errors should be failures, since they represented cases where we did not get a full response from the CW API, which means we can't trust that the logs were actually received.

So now I make it fail unless we get a full response with a request ID.

I'm not certain why this happens in the first place- why we get a 200 response but it seems to be incomplete. I noticed in my own testing that this 200 response where we can't get the sequence token is never followed by an invalid sequence token exception in the next call. Which almost certainly indicates that CW didn't actually uptake the last batch of logs.

There is some longer term investigation needed here on the deeper root cause, which is something like one of the following options:

  1. CW API somehow returns an incomplete response sometimes
  2. Fluent Bit is somehow incorrectly processing the response and thinks its valid when it isn't.

However, for the short term, the fix that was just released should ensure there is no log loss.

If anyone sees duplicated logs and this new error message, please let me know, that would be interesting. (This did not occur in my testing).

PettitWesley commented 3 years ago

Also, with these bug reports, can folks please quantify what "a lot" of errors means? In my own testing, I never saw either of these errors more than roughly once per day per task. And that was running a bunch of tasks, all shipping ~1000 TPS logs to CW.

psyhomb commented 3 years ago

Also, with these bug reports, can folks please quantify what "a lot" of errors means? In my own testing, I never saw either of these errors more than roughly once per day per task. And that was running a bunch of tasks, all shipping ~1000 TPS logs to CW.

Sure thing, I have noticed ~88 of these errors in one hour for a single task. But after that everything has back to normal, I don't see any more errors so far.

psyhomb commented 3 years ago

@psyhomb This is expected. The results of my investigation showed that the old sequence token errors should be failures, since they represented cases where we did not get a full response from the CW API, which means we can't trust that the logs were actually received.

So now I make it fail unless we get a full response with a request ID.

I'm not certain why this happens in the first place- why we get a 200 response but it seems to be incomplete. I noticed in my own testing that this 200 response where we can't get the sequence token is never followed by an invalid sequence token exception in the next call. Which almost certainly indicates that CW didn't actually uptake the last batch of logs.

There is some longer term investigation needed here on the deeper root cause, which is something like one of the following options:

1. CW API somehow returns an incomplete response sometimes

2. Fluent Bit is somehow incorrectly processing the response and thinks its valid when it isn't.

However, for the short term, the fix that was just released should ensure there is no log loss.

If anyone sees duplicated logs and this new error message, please let me know, that would be interesting. (This did not occur in my testing).

Thanks for detailed explanation. 👍🏼

ShawnZong commented 3 years ago

@psyhomb This is expected. The results of my investigation showed that the old sequence token errors should be failures, since they represented cases where we did not get a full response from the CW API, which means we can't trust that the logs were actually received.

So now I make it fail unless we get a full response with a request ID.

I'm not certain why this happens in the first place- why we get a 200 response but it seems to be incomplete. I noticed in my own testing that this 200 response where we can't get the sequence token is never followed by an invalid sequence token exception in the next call. Which almost certainly indicates that CW didn't actually uptake the last batch of logs.

There is some longer term investigation needed here on the deeper root cause, which is something like one of the following options:

  1. CW API somehow returns an incomplete response sometimes
  2. Fluent Bit is somehow incorrectly processing the response and thinks its valid when it isn't.

However, for the short term, the fix that was just released should ensure there is no log loss.

If anyone sees duplicated logs and this new error message, please let me know, that would be interesting. (This did not occur in my testing).

In Fargate, we tried to adopt emf to CloudWatch, but we encountered the same issue, these logs are popping up 10-15 times per hour. we've used the optimized configuration which sets Flush to 1s[1], would it be the issue I guess?

In our other service which receives more requests, these errors appear 40-50 times per hour

[1] https://github.com/aws/aws-for-fluent-bit/blob/mainline/configs/minimize-log-loss.conf

psyhomb commented 3 years ago

Be careful when using low flush intervals because you can easily hit CW Logs limits that way. https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html

5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.

PettitWesley commented 3 years ago

@ShawnZong Can you share the exact log messages?

ShawnZong commented 3 years ago

@PettitWesley Hi, this is the log messages image

PettitWesley commented 3 years ago

@ShawnZong thanks. If you can enable debug logging and show me those logs, that might be interesting since I added this debug line: https://github.com/fluent/fluent-bit/blob/master/plugins/out_cloudwatch_logs/cloudwatch_api.c#L1363

I apologize that this issue continues... I have been entirely unable to figure out a better solution to it than the current stopgap (which prevents log loss but does lead to errors). AFAICT, this issue does seem to be correlated with getting close to CW Log Stream ingestion limits. So anything which helps you stay farther away from those limits might help. That might mean increasing the flush interval. Depends on your exact use case and throughput and many details.

ShawnZong commented 3 years ago

@PettitWesley Hi, the debug log shows that the response body is empty image

ShawnZong commented 3 years ago

@PettitWesley I've investigated for a while, I think it does look like we're getting closer to something related CW Log Stream ingestion limits. I've tested out fluent bit with and without this optimized configuration[1]. It turns out that when I use this optimized configuration which set flush timeout to 1s, and Grace to 30s, it would always cause the x-amzn-RequestId header not found error. However, if I just use the default setting which has a Flush timeout of 5s, the header error disappeared[2]. Btw I think the error in [2] is normal? because I was load testing it, and it could bring network issue if too many log events want to send at the same time?

I've created an AWS support ticket, and Yuriy M. from CloudWatch Team said she would have a look at this issue though she couldn't answer the reason for now. Nevertheless, she reminded me to check the CloudWatch quota:

The following limits apply to the PutLogEvents API call [3][4]:

 1. 5 requests per second per log stream. Additional requests are throttled. This quota cannot be changed
 2. The maximum batch size of a PutLogEvents request is 1MB
 3. Total of 800 transactions per second per account per Region. (Except for the following Regions where the quota is a total of 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland)

In Fluent Bit, you can avoid throttling issues by sending log events to a unique log stream per task. Furthermore, it would be best to experiment with increasing the Flush setting in the Service section to be greater than 5 seconds (the default), this might increase the number of events sent per flush, which could decrease the total number of calls to CloudWatch [5].

References: [1] https://github.com/aws/aws-for-fluent-bit/blob/mainline/configs/minimize-log-loss.conf [2] image [3]. https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html [4]. https://docs.aws.amazon.com/general/latest/gr/api-retries.html [5]. https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file

PettitWesley commented 3 years ago

@ShawnZong Thank you! And I agree, my own testing strongly suggested that this is caused by throttling. One other piece of evidence for this is that I am not aware of an equivalent issue with any of the other Fluent Bit core plugins- there are other plugins that need to check the response body. If this was a bug in Fluent Bit's networking, I'd expect to see other plugins having equivalent issues, since all core plugins share the same networking code.

It is still strange though that Fluent Bit either actually receives or thinks it has received a partial http response with a 200 code but no response body.

However, if I just use the default setting which has a Flush timeout of 5s, the header error disappeared[2]. Btw I think the error in [2] is normal? because I was load testing it, and it could bring network issue if too many log events want to send at the same time?

Yes this is my understanding as well. I have been told by the CW API team that one of their throttling mechanisms is simply to block connections from the same IP. This then leads to Fluent Bit outputting a connection error message.

wtchangdm commented 3 years ago

FWIW, I also changed from flush 1 to flush 5 recently, and I did notice way fewer (if not totally disappears) warnings like "Recieved code 200 but response was invalid..." than it was before.

Even with high amount of logs I only saw "[error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded for logStreamName ", but it's different situation.

PettitWesley commented 3 years ago

Hey folks, I had another idea on this one, if anyone is willing please try out the following image and let me know what you see in the logs:

144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:latest

This image has the repo policy here and can be pulled from any AWS account.

I think this patch might reduce the frequency with which you see these errors.

byrneo commented 3 years ago

I think this patch might reduce the frequency with which you see these errors.

Hi @PettitWesley - we are facing this issue in a couple of our EKS clusters also. I've just updated to your image in one cluster and will keep an eye out. Will keep you posted...cheers.

byrneo commented 3 years ago

Hmmm, seeing pod restarts due to OOM with this one (i'm using the same configs per AWS/EKS docs) just with the patched image:

 State:          Running
      Started:      Thu, 09 Sep 2021 12:01:13 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 09 Sep 2021 11:17:06 +0100
      Finished:     Thu, 09 Sep 2021 12:01:12 +0100
    Ready:          True
    Restart Count:  4
    Limits:
      memory:  200Mi
    Requests:
      cpu:     500m
      memory:  100Mi
psyhomb commented 3 years ago

To be honest we have been enough all kind of different issues with Fluent Bit, we switched to Filebeat OSS and so far everything is working flawlessly.

ShawnZong commented 3 years ago

@byrneo this might be the reason? https://docs.fluentbit.io/manual/administration/buffering-and-storage image

byrneo commented 3 years ago

hey, thanks @ShawnZong - indeed, I am using the AWS configs referenced from here

They do appear to have set: Mem_Buf_Limit 5MB

Just mentioning it in the context of having tried @PettitWesley's patched image ( I don't recall hitting OOM before)

PettitWesley commented 3 years ago

@byrneo Can you share your config? That mem_buf_limit is a bit low for most application logging use cases... I think in the container insights config map the inputs for system logs only have this low limit...

Let me also double check that image for memory leaks and re-build you a new image from an older stable version with my patch...

byrneo commented 3 years ago

@byrneo Can you share your config? That mem_buf_limit is a bit low for most application logging use cases... I think in the container insights config map the inputs for system logs only have this low limit...

Sure @PettitWesley - just note, i didn't modify any of the limits in the configMap (i just used the ones linked to in the containerInsights docs)

<snip>
application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  container_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/fluent-bit*
        Parser              docker
        DB                  /var/fluent-bit/state/flb_log.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 application.*
        Path                /var/log/containers/cloudwatch-agent*
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  cwagent_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_cwagent.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              Off
        Annotations         Off

    [OUTPUT]
        Name                cloudwatch_logs
        Match               application.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  dataplane-log.conf: |
    [INPUT]
        Name                systemd
        Tag                 dataplane.systemd.*
        Systemd_Filter      _SYSTEMD_UNIT=docker.service
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        DB                  /var/fluent-bit/state/systemd.db
        Path                /var/log/journal
        Read_From_Tail      ${READ_FROM_TAIL}

    [INPUT]
        Name                tail
        Tag                 dataplane.tail.*
        Path                /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  container_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_dataplane_tail.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                modify
        Match               dataplane.systemd.*
        Rename              _HOSTNAME                   hostname
        Rename              _SYSTEMD_UNIT               systemd_unit
        Rename              MESSAGE                     message
        Remove_regex        ^((?!hostname|systemd_unit|message).)*$

    [FILTER]
        Name                aws
        Match               dataplane.*
        imds_version        v1

    [OUTPUT]
        Name                cloudwatch_logs
        Match               dataplane.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/dataplane
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  host-log.conf: |
    [INPUT]
        Name                tail
        Tag                 host.dmesg
        Path                /var/log/dmesg
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_dmesg.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.messages
        Path                /var/log/messages
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_messages.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.secure
        Path                /var/log/secure
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_secure.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}
<snip>

I can post the entire config if needed

Let me also double check that image for memory leaks and re-build you a new image from an older stable version with my patch...

Cool - thanks!

PettitWesley commented 3 years ago

@byrneo Sorry, the latest tag I pushed to that ECR repo was built off of the latest commits in the 1.8 branch which may not be completely stable.

I've built the same patch on top of a 3 different versions which my team is using for performance testing right now, so we are fairly confident these are stable and do not leak memory. (I should also note that high memory usage or spikes in memory != memory leak, you might just have too many logs and not enough allocation for the fluent bit pod). Let me know if you (or anyone else) wants me to build this patch on top of any other versions:

144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.8.2
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.7.9
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.7.5
byrneo commented 3 years ago

Thanks @PettitWesley - would it be possible to patch on top of fluent-bit 1.8.6 which seems to be used by https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.19.1

PettitWesley commented 3 years ago

@byrneo Please try:

144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.8.6
byrneo commented 3 years ago

hi @PettitWesley - i gave that image a try today. No OOM terminations on any of the instances (this is on a 3x node cluster), but did have one Pod restart after a few hours of operation:

   State:          Running
      Started:      Mon, 13 Sep 2021 13:34:18 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Mon, 13 Sep 2021 10:39:00 +0100
      Finished:     Mon, 13 Sep 2021 13:34:16 +0100

as well as these (from the same instance):

[2021/09/13 14:29:29] [ warn] [http_client] malformed HTTP response from logs.us-east-1.amazonaws.com:443 on connection #174
[2021/09/13 14:29:29] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/09/13 14:29:29] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/09/13 14:29:29] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2021/09/13 14:29:29] [ warn] [engine] failed to flush chunk '1-1631543369.515077516.flb', retry in 9 seconds: task_id=3, input=tail.0 > output=cloudwatch_logs.0 (out_id=0)
[2021/09/13 14:35:00] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
[2021/09/13 14:52:04] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096

Of the other two instances, no warnings seen from one, and just this single warning from the other:

2021/09/13 11:44:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=126888450 watch_fd=40
[2021/09/13 11:48:14] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
[2021/09/13 11:49:42] [ info] [input:tail:tail.0] inotify_fs_add(): inode=197138454 watch_fd=42 name=/var/log/containers/<name_redacted>.log
PettitWesley commented 3 years ago

@byrneo So the fix seems to have possibly worked? The http buffer warnings is a separate thing which we submitted a fix for already. Not sure about the single crash, we have seen some stability problems with the recent fluent bit releases. But let's focus on the main issue here. Please let me know if that resurfaces. If it doesn't then I will submit my patch to the upstream, and hopefully this ticket can be resolved.

byrneo commented 3 years ago

@PettitWesley - ok, after another 12 hours. Still no errors from 2/3 instances. On the instance that threw the error yesterday, just one more occurrence overnight:

[2021/09/14 05:35:36] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
[2021/09/14 05:35:36] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
[2021/09/14 05:35:36] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send events
[2021/09/14 05:35:36] [ warn] [engine] failed to flush chunk '1-1631597730.696905125.flb', retry in 6 seconds: task_id=2, input=tail.6 > output=cloudwatch_logs.2 (out_id=2)