fluent / fluent-bit-kubernetes-logging

Fluent Bit Kubernetes Daemonset
Apache License 2.0
468 stars 248 forks source link

CPU and Memory Requests and Limits? #14

Open StevenACoffman opened 6 years ago

StevenACoffman commented 6 years ago

What would be a good set of CPU and memory requests and limits? For comparison, this is filebeat:

        resources:
          requests:
            cpu: 2m
            memory: 10Mi
          limits:
            cpu: 10m
            memory: 20Mi

I know that the documentation talks about Memory limits being dependent on the buffer amount.

So, if we impose a limit of 10MB for the input plugins and considering the worse case scenario of the output plugin consuming 20MB extra, as a minimum we need (30MB x 1.2) = 36MB.

Given the Mem_Buf_Limit 5MB, would we need 13 MB? I have very little insight as to an appropriate CPU request and limit.

edsiper commented 6 years ago

@StevenACoffman

I think that real memory requirements will depends on the amount of filters and output plugins defined, using the approach described earlier should work.

For CPU we need to do some tests..

solsson commented 6 years ago

18 increased the memory limits but did not try to address spikes. https://github.com/fluent/fluent-bit-kubernetes-logging/pull/19 tries to restrict the producer buffers, which if I interpret https://github.com/fluent/fluent-bit-kubernetes-logging/pull/16#issuecomment-359487685 correctly should allow a memory limit as discussed in http://fluentbit.io/documentation/0.12/configuration/memory_usage.html#estimating. Sadly I have no good test environment with unprocessed logs, so I'll just have to keep this running for a couple of days. Probably won't validate startup.

StevenACoffman commented 6 years ago

@solsson If you delete the fluent-bit daemonset (and pods) and then delete the file /var/lib/docker/containers/flb_kube.db on the host file (maps to /var/log/flb_kube.db from the container) and re-apply the daemonset, the new fluent-bit daemonset will reprocess all the host logs, so you can retest spikes.

solsson commented 6 years ago

Did so on two nodes now. It's regular GKE and I guess logs are rotated because I have no node with more than 100M of container logs.

Here's the 10 min rate of bytes in for the two pods I tested:

And this is the memory consumption:

skarmavbild 2018-02-07 kl 21 50 15

The result won't mean much without higher log volumes, but the two containers that just started processing without flb_kube.db do have the highest memory use. And I did see one of them restart initially (no evidence of OOMKilled though), but then successfully catching up.

Actually I've never seen the memory use of a fluent-bit pod go down.

edsiper commented 6 years ago

@solsson I got confused with the graphics, in the memory consumption chart, what each line represents ?

solsson commented 6 years ago

A container, and because there's only one container in fluent-bit pods it also represents a pod. They're all from the daemonset so there's one pod per node. The end of a line means the pod got killed. In the case above I did three rolling upgrades with revert, hence some pods survived and some were re-created (at which point they get a new name).

I don't think we can draw much conclusions from the graphs above. Higher log volumes would be useful. But memory and cpu limits must always be adapted to workload, and to me it looks like the current limits from #18 work fine, as the current spike used about 65% of max. I think that the capping of kafka's buffer in #19 lead to less dramatic spikes, but it could also be the CPU cap or the IO situation that does so because ingestion get limited.