[fluentd] Add StatefulSet

nvtkaszpir commented 4 years ago

In general it would allow to create log aggregator instances, as described at https://docs.fluentd.org/deployment/high-availability but in more like a active-active setup.

Use case: We use fluentd as daemonset but with custom config which forwards all processed events into another fluentd instance. That fluentd instance is later on sending data to elasticsearch. It is more like a buffer for the daemonsets, because that fluend instance stores data on the disk in case of elasticsearch unavailability or slowness. Other reasons:

not everyone uses elasticsearch backend
you don't have to expose to this setup data from the host, because daemonset is handling that. This way statefulset can be much simple to set up and is more k8s-agnostic
data retains on disks in case of pod deletion and recreation (see below)
spawning a lot of short living nodes with pods (like 1000 in 1h) with fleuntd on it was actually hammering elasticsearch making it unavailable

Why not to use daemonset directly to elastic? Daemonsets tend to have a small buffers on the disk (usually few megabytes, but still), which pauses processing of the input if the buffers are full. So if you have slow backend service you whole logging is throttled and in some cases logs from the short living containers are lost. Moreover, when used with preemtible instances in GKE or spot instances in AWS it just may not have enough time to retry and flush the buffer before instance is force killed. That's why we reconfigure it to send data to another fluentd instance, which also works as a buffer (it runs on long living nodes).

Why not deployment for fluentd? As above daemonset, the buffer is kept on the disk, and it works okay until pod is terminated. If the pod is not able to flush data from disk to backend service then data on pvc will be lost. So the daemonset is ok to the certain extent, but is still lacking, especially in edge situations.

Fix: Use StatefulSet with dedicated pvc for disk buffer. This way the volumes names are predictable, and you can much easily scale system up an down, while keeping the data - even to the extent of totally wiping pods and recreating them - the disks will be re-attached with old data. It can be also extended with autoscaling capabilities based on custom metrics - in that case for example disk usage as buffer pool. Termination grace period would need to be adjusted so that pod has time to flush data to the backend service, so this depends on the use case but usually 1h is enough.

Something similar is done currently with elasticsearch clusters (sts with termination period set to 1h so that pod can export data shards).

naseemkullah commented 4 years ago

Thanks @nvtkaszpir , sorry for the delay in responding. Would you be interested in submitting a PR? The pod template can be leveraged for the statefulset: https://github.com/fluent/helm-charts/blob/master/charts/fluentd/templates/_pod.tpl to keep pod consistent across workload types.

nvtkaszpir commented 4 years ago

Maybe next week I'll be able to do it, will see.

stevehipwell commented 4 years ago

Any progress on this?

nvtkaszpir commented 4 years ago

I wasn't able to do it recently due to maultiple reasons, but in the upcoming two weeks it is on my todo list.

stevehipwell commented 4 years ago

@nvtkaszpir are you working from the current stable chart, another chart (bitnami) or from scratch?

nvtkaszpir commented 4 years ago

I haven't decided yet, but looks like current chart will be expanded with statefulset.yaml, because it is the easiest. Later on will probably import stuff from helm/charts:stable/fluentd. I did not look at bitnami. Any special suggestions?

First I want to add code to use for PR under github, something like this:

trigger github actions
spawn KIND cluster
run helm chart-testing
add additional ci/ directories with values files to test different scenarios.

That would be the base for further improvements.

stevehipwell commented 4 years ago

@nvtkaszpir I'd strongly suggest basing your chart changes on stable/fluentd (the stateful set parts). My pain points are currently around the way that the stable/fluentd Helm chart is so tightly linked to the Docker image, plus the fact that there isn't a default FluentD image anywhere with the correct config to install plugins dynamically (the image in stable/fluentd can as can Bitnami).

nvtkaszpir commented 4 years ago

Charts and used image is another story, because it really depends what we want to get from fluentd:

just basic input/output? stick to official fluentd image https://hub.docker.com/r/fluent/fluentd/
fetch logs from hosts in k8s? - try https://github.com/fluent/fluentd-kubernetes-daemonset
anything else should allow using custom image (because installing gem files on container launch is rather bad idea, but it is doable)

For me personally stable/fluentd helm chart has some weird logic whihc decides if the setup is a daemonset/deployment or statefulset. In fact I use different charts for different purposes (like kiwigrid chart for fluentd-elastiscearch with custom config with forward protocol to push to stable/fluend chart, which finally pushes to elasticsearch)

Either way, those wll go to another tickets.

stevehipwell commented 4 years ago

@nvtkaszpir I'm not sure if it helps but I've created a statefulset Helm chart for Fluentd as part of my chart repo https://github.com/stevehipwell/helm-charts/tree/master/charts/fluentd.

Also in my opinion, when running as an aggregator (statefulset) the small startup cost of installing a limited set of plugins far outweighs the complexity of managing a custom Docker image. So a chart for aggregation should allow the user to make this decision.

nvtkaszpir commented 4 years ago

Sorry for delay (lots of things on the head). I've been doing some helm chart updates this week, which should address multiple issues already reported.

In short:

I've backported sts, custom configs and custom images (and more) into the chart
I'm finalizing test suite to run with ct and kind and docs.
installing custom plugins was NOT implemented, because the user should create their own custom container image and use that (instead of relying of installing some stuff on pod creation). Not to mention it may be just not possible because of permissions/psp/read-only container.

PR should be ready this week.

nvtkaszpir commented 4 years ago

Okay, took a bit longer than expected due to other issues, but I'm aiming at this Thursday with some PR labeled as Work In Progress.

The major slowdowns were:

fluend in docker REALLY avoids dealing with buffers
get ready to run your own customized container, because there is no official 'all and the kitchen sink' container version
/fluentd/ since 1.x requires some config adjustments
official fluentd docker image does not contain prometheus metrics at all, so certain things needs to be altered (or actually dumbly omitted like readiness/liveness probes)
i may need to yet add extra custom preStop commands due to buffer flushing (or will leave it for future)
altering workers count on existing setup actually causes data loss on redepoy (waiting for answer what to do with those files from non-existing workers on slack, so far no answer)
extra configs for prometheus + inputs to get some nice graphs - see https://grafana.com/grafana/dashboards/13042 for a preview of what is coming).
sending SIGUSR1 into the container processes stops generating metrics, but the container is still alive.
persistence in daemonsets/deployments - rather not going to happen, I'm aiming at statefulset only.

If you have any suggestions even before seeing PR just drop comment here.

stevehipwell commented 4 years ago

@nvtkaszpir how are you getting on with this?

I've got a couple of questions/suggestions:

Are you planning on creating a separate chart for aggregation via a statefulset, e.g. fluentd-aggregator?
- This would allow you to keep the chart very simple which seems to be the purpose of the charts in this repo (specifically fluent-bit)
Are you using the Grafana sidecar pattern for dashboards?
- I'd be +1 for this
What are the issues with worker counts and buffer flushing?
- The current logic of flushing memory but leaving the filesystem seems correct
- I'd suggest an MVP that says if you increase the replicas then you can't decrease them without losing data
- In the future a mechanism to flush data from the filesystem before shutdown could be added to solve this
What issues are you having with the buffering on Docker?
Have you looked at creating a generic aggregator docker image with Prometheus and other common plugins?

nvtkaszpir commented 4 years ago

Sorry for delay, some urgent private stuff popped out, so no PR yet. Edit: this turns into a multiple issues in one issue .... as usual ;)

Are you planning on creating a separate chart for aggregation via a statefulset, e.g. fluentd-aggregator? fluent-bit is another story.

After some work it looks like keeping daemonsets, deployments and statefulsets in the same chart is indeed problematic. Some things can be done with _pod template, but most of differences are between sts/deployment/daemonset especialy because of containers and configs used. I've managed to create specific sts setup, but that means it works with custom image and custom configs and has custom lifecycle hooks and so on... The general feeling is that customizations go to values files, especially if some settings are actually very specific to the plugins installed.

This unfortunately leads to more and more specific use cases, so chart-testing should be used to run the tests with specific values files (for example sts+pvc+custom container and so on, but that also means helm test task really differs depending on the fluentd config)

I'm not convinced (at least yet) to create separate chart for specific fluentd use case, though there are already existing other solutions such as helm chart fro kiwigrid/fluentd-elasticsearch (which is a daemonset and by default sends to elastic cluster, but that can be easily altered to send to anything you want, for example separate fluend sts cluster as I do it, the chart itself has some good configs to fetch logs from multiple sources such as containers, systemd, saltstack and so on).

I'll be able to create PR this week (finally managed to have some time for this) as WIP/Draft and you will see, notice this will be a huge one and surely needs to be replaced by some smaller ones.

Current progress (not published yet)

statefulset + volume tempates - for buffers
using custom container (this in fact forces the next option)
use custom configs/secrets, so that you can use different config per deployment because of for example use case/container version
customizable healchecks
customizable pre/post script on container termination
prometheus servicemonitor, but needs to be changed to podmonitor due to the using high termination grace periods

Are you using the Grafana sidecar pattern for dashboards?

Yeah, the idea is to add configmap to helm chart as a dasboard, so that grafana-dasbboard-import sidecar can import it on grafana pod start (init phase).

What are the issues with worker counts and buffer flushing?

There are actually two issues:

instance count, such as number of pods in sts, usually it is increased and never decreased
worker count within fluend process, set via config, changing worker count effectively laters how data is written to the disks, and usually decreasing number of workers leads to data getting not flushed (it stays on the disk) For now I just added a comments in the configs so the users are aware of what happens when given values are changed. If I have more time I can test longer termination grace periods and how the data is flushed on pod termination (this can be controlled by config setting + preStop exec + longer termination periods but still it only increases the chances of not loosing data, for example termination time 1h and the signal to fluentd will force fluend to flush buffers from memory and disk). I have to see if there is any way to detect if the given pod is going to be restarted and recreated (pod restart) or terminated (due to scale down)... but that sounds more like a kubernetes controller...

What issues are you having with the buffering on Docker?

We had an issue with some apps that were producing so massive amounts of logs that the logs were rotated faster than fluend could handle to follow them (cause it had not enough cpu), which in the result ended in loosing logs. But that's more of an issue of app generationg too many logs and not adjusting fluend to cope with that amount of data.

Have you looked at creating a generic aggregator docker image with Prometheus and other common plugins?

For now I used my own custom image. Default image does not have metrics at all, which forces to use custom healthchecks. but this raises another issue - who will maintain such container? There are some already in the net but usually they are adjusted to specific use case (such as in other helm repos or in fluend-kubernetes-daemonset)

nvtkaszpir commented 3 years ago

long time no see. unfortulnately due to IP protection from the client I was unable to publish it. As a workaround I would have rewrite it significantly to make it in compliance with the prior client.

Since I changed companies this dropped from my todo list from important to 'nice to have', cause we don't use fluentd in the new place. Even though, it would be in another repo anyway (official hem chart).

Thus, closing this issue. If you still want to see it implemented then see https://github.com/fluent/helm-charts issues.

fluent / helm-charts

[fluentd] Add StatefulSet #12