elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.54k stars 685 forks source link

Elastic operator renewing Elasticsearch internal certificates breaks Stack Monitoring #5448

Open Ricardolaponder opened 2 years ago

Ricardolaponder commented 2 years ago

Bug Report

What did you do?

We have deployed Elasticsearch on Kubernetes with ECK. For monitoring we have deployed a monitoring cluster and use stack monitoring with beats to monitor our production cluster with the monitoring cluster. This works fine before ECK renewed the internal certificates the Elasticsearch cluster uses for internal communication.

What did you expect to see?

logs and metrics from the production cluster before and after the certificate change in stack monitoring in the Monitoring cluster.

What did you see instead? Under which circumstances?

Only logs from the production cluster after the certificate change, metricbeat stopped sending metrics.

Environment

[2022-03-07T12:47:39,404][WARN ][o.e.h.AbstractHttpServerTransport] [elasticsearch-es-master-2] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:38980} io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate

Metricbeat gave this error: 2022-03-07T15:33:38.709Z ERROR module/wrapper.go:259 Error fetching data for metricset elasticsearch.node_stats: error making http request: Get "https://localhost:9200/_nodes/_local/stats": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "elasticsearch-http")

I saw that the new internal CA was mounted in the metricbeat container, but I had to restart the metricbeat container to fix this issue. Filebeats certificate verification mode is on certificate

thbkrkr commented 2 years ago

Yes, I see the problem. The new certificates (from the monitored cluster and from the monitoring cluster) are well propagated in the Metricbeat container. Metricbeat uses a persistent connection so as long as the connection is established, it works, even if the certificate has expired. As soon as the connection is closed, Metricbeat tries to reconnect with the old certificate without considering the new certificate and got the PKI error x509: certificate signed by unknown authority.

Temporary workaround: kill the Beat process to recreate the Beat container (kubectl exec $esPod -c metricbeat -- kill 1).

milanage commented 10 months ago

@thbkrkr Issue still happens on ECK 2.6.1 + Stack 8.8.1. It looks like https://github.com/elastic/beats/pull/34416 does not really help. What's our plan to fix this issue?

VCCPlindsten commented 5 months ago

This still happens on ECK 2.10 & Stack 8.12.0

KannappanSomu commented 5 months ago

+1