elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.57k stars 697 forks source link

Beat readiness probe #3197

Open sebgl opened 4 years ago

sebgl commented 4 years ago

We probably want to introduce a readiness probe for Beats. It's a bit surprising right now to see filebeat "ready" while Elasticsearch is unavailable.

It looks like we could execute a filebeat test output command. To investigate.

david-kow commented 4 years ago

What ready should indicate though? If Beat can start getting logs/metrics in, I'd consider it ready even if the output is not ready itself. I'd think that's what outputs (ES for instance) ready is for.

anyasabo commented 4 years ago

For filebeat, filebeat test output is at least what the helm chart uses: https://github.com/elastic/helm-charts/blob/master/filebeat/values.yaml#L72

anyasabo commented 4 years ago

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-readiness-probe

If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In this case, the readiness probe might be the same as the liveness probe, but the existence of the readiness probe in the spec means that the Pod will start without receiving any traffic and only start receiving traffic after the probe starts succeeding. If your Container needs to work on loading large data, configuration files, or migrations during startup, specify a readiness probe.

If you want your Container to be able to take itself down for maintenance, you can specify a readiness probe that checks an endpoint specific to readiness that is different from the liveness probe.

The main reason I can think we would want to define a readiness probe is if you were using beats to monitor your other beats. In that case I think you would want to know if the beat was up but the output was down (and so it should be ready even if the output is down).

"Is the output responding" seems more of a question of health in the beats status. I'm not sure there's a good way for ECK to retrieve that though. We currently define beat health as:

const (
    // BeatRedHealth means that the health is neither yellow nor green.
    BeatRedHealth BeatHealth = "red"

    // BeatYellowHealth means that:
    // 1) at least one Pod is Ready, and
    // 2) association is not configured, or configured and established
    BeatYellowHealth BeatHealth = "yellow"

    // BeatGreenHealth means that:
    // 1) all Pods are Ready, and
    // 2) association is not configured, or configured and established
    BeatGreenHealth BeatHealth = "green"
)
david-kow commented 4 years ago

In that case I think you would want to know if the beat was up but the output was down (and so it should be ready even if the output is down).

I'm not sure I'm getting what do you mean here. If we have:

ES    <----    Metricbeat    --(monitoring)-->    Filebeat    --(shipping logs for)-->    Pod

Then we can have the following (main) failure cases:

  1. Pod is down - Metricbeat and Filebeat are ready
  2. Filebeat is down - the fact that Filebeat is down is reported by Metricbeat, but the Metricbeat itself is ready
  3. ES is down - Metricbeat can't output, but it's running (and caches the data) so it's ready

For "Is the output responding" I agree it's difficult, I think we would only know from logs that there is an issue.

anyasabo commented 4 years ago

I'm not sure I'm getting what do you mean here.

Because I did a poor job of explaining it :D What I meant was that I think we want to leave it as is for the reasons you described in your comment. If we want to do anything it would be exposing the output status in the Beats CR, but I'm not sure we can simply (maybe the beats state/status endpoint exposes the info?).

pebrc commented 4 years ago

We should probably close this in favour of another issue that will update the status of the Beats resource with some information about the output status.

Just as an aside because filebeat test output was mentioned, it returns an error despite a working configuration due to a DNS check it does:

[root@gke-pebrc-dev-cluster-default-pool-0ce0f2c1-nl52 filebeat]# filebeat test output
elasticsearch: http://elasticsearch:9200...
  parse url... OK
  connection...
    parse host... OK
    dns lookup... ERROR lookup elasticsearch on 10.73.16.10:53: no such host