Azure / azure-databricks-operator

Kubernetes Operator for Databricks
MIT License
113 stars 48 forks source link

Delays in reconciliation under load #140

Closed stuartleeks closed 4 years ago

stuartleeks commented 4 years ago

(Pre-emptive shout-out to @EliiseS, @storey247 and @lawrencegripper as this work has been a group effort) As mentioned in #131 we have been performing some load tests against the operator. Our initial load run shows raised work-queue latency and an increasing work-queue depth.

image

It's worth noting that the histogram buckets for the latency are 0.1s, 1s, 10s, so a value of 10 on the graph in effect means somewhere between 1s and 10s.

Looking at the metrics for the mock api that we're using for the load tests, the reponse times for that look pretty constant:

image

What we can see in the mock api metrics are periods of time where there are no requests being made to the API (and these become more pronounced as the test load ramps up).

Looking at this, our hypothesis was that there is something causing the reconciliation loops to block.

stuartleeks commented 4 years ago

Closed by #141