Azure / azure-databricks-operator

Kubernetes Operator for Databricks
MIT License
113 stars 48 forks source link

Operator becomes unavailable if pod restarts #164

Open EliiseS opened 4 years ago

EliiseS commented 4 years ago

When running load tests we discovered that if the operator pod restarts, the operator becomes unavailable for about a minute.

Below we can see that Prometheus has been unable to gather metrics between 15:20 to 15:21 image

Looking into the operator pod, we can see it has restarted: image

Inside the pod we can see it was terminated due to an error and started again at 15:19:53, which matches the gap found above.

 k describe pods/azure-databricks-operator-controller-manager-578c8696bd-8jfw7
...
    State:          Running
      Started:      Tue, 04 Feb 2020 15:19:53 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 04 Feb 2020 14:45:32 +0000
      Finished:     Tue, 04 Feb 2020 15:19:51 +0000
...