Open EliiseS opened 4 years ago
When running load tests we discovered that if the operator pod restarts, the operator becomes unavailable for about a minute.
Below we can see that Prometheus has been unable to gather metrics between 15:20 to 15:21
Looking into the operator pod, we can see it has restarted:
Inside the pod we can see it was terminated due to an error and started again at 15:19:53, which matches the gap found above.
k describe pods/azure-databricks-operator-controller-manager-578c8696bd-8jfw7 ... State: Running Started: Tue, 04 Feb 2020 15:19:53 +0000 Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 04 Feb 2020 14:45:32 +0000 Finished: Tue, 04 Feb 2020 15:19:51 +0000 ...
When running load tests we discovered that if the operator pod restarts, the operator becomes unavailable for about a minute.
Below we can see that Prometheus has been unable to gather metrics between 15:20 to 15:21
Looking into the operator pod, we can see it has restarted:
Inside the pod we can see it was terminated due to an error and started again at 15:19:53, which matches the gap found above.