fluent / helm-charts

Helm Charts for Fluentd and Fluent Bit
Apache License 2.0
377 stars 452 forks source link

[fluentd] Add StatefulSet #12

Closed nvtkaszpir closed 3 years ago

nvtkaszpir commented 4 years ago

In general it would allow to create log aggregator instances, as described at https://docs.fluentd.org/deployment/high-availability but in more like a active-active setup.

Use case: We use fluentd as daemonset but with custom config which forwards all processed events into another fluentd instance. That fluentd instance is later on sending data to elasticsearch. It is more like a buffer for the daemonsets, because that fluend instance stores data on the disk in case of elasticsearch unavailability or slowness. Other reasons:

Why not to use daemonset directly to elastic? Daemonsets tend to have a small buffers on the disk (usually few megabytes, but still), which pauses processing of the input if the buffers are full. So if you have slow backend service you whole logging is throttled and in some cases logs from the short living containers are lost. Moreover, when used with preemtible instances in GKE or spot instances in AWS it just may not have enough time to retry and flush the buffer before instance is force killed. That's why we reconfigure it to send data to another fluentd instance, which also works as a buffer (it runs on long living nodes).

Why not deployment for fluentd? As above daemonset, the buffer is kept on the disk, and it works okay until pod is terminated. If the pod is not able to flush data from disk to backend service then data on pvc will be lost. So the daemonset is ok to the certain extent, but is still lacking, especially in edge situations.

Fix: Use StatefulSet with dedicated pvc for disk buffer. This way the volumes names are predictable, and you can much easily scale system up an down, while keeping the data - even to the extent of totally wiping pods and recreating them - the disks will be re-attached with old data. It can be also extended with autoscaling capabilities based on custom metrics - in that case for example disk usage as buffer pool. Termination grace period would need to be adjusted so that pod has time to flush data to the backend service, so this depends on the use case but usually 1h is enough.

Something similar is done currently with elasticsearch clusters (sts with termination period set to 1h so that pod can export data shards).

naseemkullah commented 4 years ago

Thanks @nvtkaszpir , sorry for the delay in responding. Would you be interested in submitting a PR? The pod template can be leveraged for the statefulset: https://github.com/fluent/helm-charts/blob/master/charts/fluentd/templates/_pod.tpl to keep pod consistent across workload types.

nvtkaszpir commented 4 years ago

Maybe next week I'll be able to do it, will see.

stevehipwell commented 4 years ago

Any progress on this?

nvtkaszpir commented 4 years ago

I wasn't able to do it recently due to maultiple reasons, but in the upcoming two weeks it is on my todo list.

stevehipwell commented 4 years ago

@nvtkaszpir are you working from the current stable chart, another chart (bitnami) or from scratch?

nvtkaszpir commented 4 years ago

I haven't decided yet, but looks like current chart will be expanded with statefulset.yaml, because it is the easiest. Later on will probably import stuff from helm/charts:stable/fluentd. I did not look at bitnami. Any special suggestions?

First I want to add code to use for PR under github, something like this:

That would be the base for further improvements.

stevehipwell commented 4 years ago

@nvtkaszpir I'd strongly suggest basing your chart changes on stable/fluentd (the stateful set parts). My pain points are currently around the way that the stable/fluentd Helm chart is so tightly linked to the Docker image, plus the fact that there isn't a default FluentD image anywhere with the correct config to install plugins dynamically (the image in stable/fluentd can as can Bitnami).

nvtkaszpir commented 4 years ago

Charts and used image is another story, because it really depends what we want to get from fluentd:

For me personally stable/fluentd helm chart has some weird logic whihc decides if the setup is a daemonset/deployment or statefulset. In fact I use different charts for different purposes (like kiwigrid chart for fluentd-elastiscearch with custom config with forward protocol to push to stable/fluend chart, which finally pushes to elasticsearch)

Either way, those wll go to another tickets.

stevehipwell commented 4 years ago

@nvtkaszpir I'm not sure if it helps but I've created a statefulset Helm chart for Fluentd as part of my chart repo https://github.com/stevehipwell/helm-charts/tree/master/charts/fluentd.

Also in my opinion, when running as an aggregator (statefulset) the small startup cost of installing a limited set of plugins far outweighs the complexity of managing a custom Docker image. So a chart for aggregation should allow the user to make this decision.

nvtkaszpir commented 4 years ago

Sorry for delay (lots of things on the head). I've been doing some helm chart updates this week, which should address multiple issues already reported.

In short:

PR should be ready this week.

nvtkaszpir commented 4 years ago

Okay, took a bit longer than expected due to other issues, but I'm aiming at this Thursday with some PR labeled as Work In Progress.

The major slowdowns were:

If you have any suggestions even before seeing PR just drop comment here.

stevehipwell commented 4 years ago

@nvtkaszpir how are you getting on with this?

I've got a couple of questions/suggestions:

nvtkaszpir commented 4 years ago

Sorry for delay, some urgent private stuff popped out, so no PR yet. Edit: this turns into a multiple issues in one issue .... as usual ;)

Are you planning on creating a separate chart for aggregation via a statefulset, e.g. fluentd-aggregator? fluent-bit is another story.

After some work it looks like keeping daemonsets, deployments and statefulsets in the same chart is indeed problematic. Some things can be done with _pod template, but most of differences are between sts/deployment/daemonset especialy because of containers and configs used. I've managed to create specific sts setup, but that means it works with custom image and custom configs and has custom lifecycle hooks and so on... The general feeling is that customizations go to values files, especially if some settings are actually very specific to the plugins installed.

This unfortunately leads to more and more specific use cases, so chart-testing should be used to run the tests with specific values files (for example sts+pvc+custom container and so on, but that also means helm test task really differs depending on the fluentd config)

I'm not convinced (at least yet) to create separate chart for specific fluentd use case, though there are already existing other solutions such as helm chart fro kiwigrid/fluentd-elasticsearch (which is a daemonset and by default sends to elastic cluster, but that can be easily altered to send to anything you want, for example separate fluend sts cluster as I do it, the chart itself has some good configs to fetch logs from multiple sources such as containers, systemd, saltstack and so on).

I'll be able to create PR this week (finally managed to have some time for this) as WIP/Draft and you will see, notice this will be a huge one and surely needs to be replaced by some smaller ones.

Current progress (not published yet)

Are you using the Grafana sidecar pattern for dashboards?

Yeah, the idea is to add configmap to helm chart as a dasboard, so that grafana-dasbboard-import sidecar can import it on grafana pod start (init phase).

What are the issues with worker counts and buffer flushing?

There are actually two issues:

What issues are you having with the buffering on Docker?

We had an issue with some apps that were producing so massive amounts of logs that the logs were rotated faster than fluend could handle to follow them (cause it had not enough cpu), which in the result ended in loosing logs. But that's more of an issue of app generationg too many logs and not adjusting fluend to cope with that amount of data.

Have you looked at creating a generic aggregator docker image with Prometheus and other common plugins?

For now I used my own custom image. Default image does not have metrics at all, which forces to use custom healthchecks. but this raises another issue - who will maintain such container? There are some already in the net but usually they are adjusted to specific use case (such as in other helm repos or in fluend-kubernetes-daemonset)

nvtkaszpir commented 3 years ago

long time no see. unfortulnately due to IP protection from the client I was unable to publish it. As a workaround I would have rewrite it significantly to make it in compliance with the prior client.

Since I changed companies this dropped from my todo list from important to 'nice to have', cause we don't use fluentd in the new place. Even though, it would be in another repo anyway (official hem chart).

Thus, closing this issue. If you still want to see it implemented then see https://github.com/fluent/helm-charts issues.