Nodes restart continuously when adding plugin.

Aiven-Open / prometheus-exporter-plugin-for-opensearch

Prometheus exporter plugin for OpenSearch & OpenSearch Mixin

Apache License 2.0

110 stars 34 forks source link

Nodes restart continuously when adding plugin. #248

Closed vipinjn24 closed 4 months ago

vipinjn24 commented 5 months ago

Using OpenSearch 2.11.1

After adding the plugin in k8s-operator, it seems that the cluster nodes goes to restart all the time, and never gets stable. I confirm that before adding the plugin the nodes were working fine for the same release.

please see attached log of one of the master nodes. opensearch.log

This issue is in 2.11.0 also, i tried 2.8.0 and it works fine, but not these 2.

Cluster is baremetal

lukas-vlcek commented 5 months ago

Edit: I removed the release content from the ticket description.

I do not see anything suspicious in the log. Is the log complete? Do you think you can share more details about how you setup the cluster?

It might be either the plugin or the k8s operator... hard to say right now. In any case it seems that having an integration test with the k8s operator will be useful, something relevant to https://github.com/Aiven-Open/prometheus-exporter-plugin-for-opensearch/issues/240

lukas-vlcek commented 5 months ago

@vipinjn24 One more note, the test suit for this plugin contains tests that do full cluster formation with the plugin installed on both nodes. The cluster consists of two nodes. Yes, it does not include the security plugin but at least a basic smoke test is part of every release. If you can provide more complete logs that would be great.

vipinjn24 commented 5 months ago

It just terminates the pods abrubtly at any point of during initialization phase no specific point. Cant say why but let me fetch logs of 2 different runs. Will get back as soon as possible

vipinjn24 commented 5 months ago

opensearch-node.log opensearch-coordinator.log opensearch-master.log

These are attached files.

only 2 master nodes bootstrapped after 7 restarts. 1 master 2 data nodes and 1 coordinator node still doing restarts with different logs

vipinjn24 commented 5 months ago

IDK, this time i deleted the cluster and created it from scratch and now it works, strange :(

lukas-vlcek commented 5 months ago

Maybe there was something wrong/corrupted with the data stored on persistent volumes if anything like that was re-attached to the nodes?

vipinjn24 commented 5 months ago

I am 100% sure i deleted it before hand. but the cluster was still restarting.

vipinjn24 commented 4 months ago

hmm now i restarted the kubernetes cluster and the entire cluster fails to start, trying to restart after removing the plugin

vipinjn24 commented 4 months ago

I did some more digging, and found that this restarts are related to the startup probes added to the nodes. Since these probes are hard coded, I had to raise a PR to update the logic of the operator and the charts. Pull Request

I was able to build this code locally and add to my network docker registry and updated the operator to use this image. this is now working fine after updating the failure threshold to somewhat more than what was initially 10.

Hoping to soon get this merged and released officially.

We can mark this issue closed

lukas-vlcek commented 4 months ago

Thanks for investigation @vipinjn24 and for k8s operator PR. Good job!