berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
62 stars 37 forks source link

Use managed prometheus #3984

Open ryanlovett opened 1 year ago

ryanlovett commented 1 year ago

Summary

User Stories

We currently deploy prometheus as part of our hub. This is fairly straightforward, however it becomes unavailable whenever something happens to the core nodes. For example today one of the nodes hit a very high load and, if we had an alert for such a condition, it wouldn't have been able to respond because prometheus was taken down by that high load.

Using the managed service would avoid this.

Acceptance criteria

Important information

Tasks to complete

balajialg commented 1 year ago

@ryanlovett Naive question, Looking at the historical data during the past seven days, it seems the load on Friday was not as high as the load during the previous two days. Do you have any hypotheses on why the load on Friday particularly had an impact? Am I missing something by looking at this data?

image image

ryanlovett commented 1 year ago

@balajialg I would guess it has something to do with the academic cycle since it is the Friday before RRR week. You could drill down and see if one particular hub had much less usage or if there was a drop across all of them.

balajialg commented 1 year ago

@ryanlovett It was a drop across almost all the major hubs. Ref: https://docs.google.com/document/d/1hw3wR_1Dc40pm7OsZYubzrkk6SD4q3vzA6i89TqStKE/edit?usp=sharing. Getting more curious why we had an outage despite having traffic that was lower than the previous days across multiple hubs.

yuvipanda commented 1 year ago

So, I created a new nodepool for support, and gave it 48G of RAM, and tried to get list of running users over last 90 days.

It failed with a timeout, and actual memory usage never went past 8G.

So went back, and looked at the PromQL query itself:

# Sum up all running user pods by namespace
sum(
  # Grab a list of all running pods.
  # The group aggregator always returns "1" for the number of times each
  # unique label appears in the time series. This is desirable for this
  # use case because we're merely identifying running pods by name,
  # not how many times they might be running.
  group(
    kube_pod_status_phase{phase="Running"}
  ) by (pod)
  * on (pod) group_right() group(
    kube_pod_labels{label_app="jupyterhub", label_component="singleuser-server", namespace=~".*"}
  ) by (namespace, pod)
) by (namespace)

I noticed that in the inner query, there's a namespace=~".*" that basically is a no-op as it matches everything but will definitely massively slow everything down as it's a regex. I remove it, and still no luck. So I remove the entire group (which is there to keep out pods that might hang on in 'completed' state or 'pending' state - a temporary and not common occurance.

So it looked like this now:

# Sum up all running user pods by namespace
sum(
    kube_pod_labels{label_app="jupyterhub", label_component="singleuser-server"}
) by (namespace)

and it loads up no problem quickly. Produces 6 month charts too.

image

Max memory usage of the server pod is 15G, which is well within its previous memory limit of 24G.

So, I don't think this is a resource problem - I think our promql queries need to be optimized for it to work.

yuvipanda commented 1 year ago

I opened https://github.com/jupyterhub/grafana-dashboards/pull/50 to remove that one extra regex, but other optimization work needs to happen

balajialg commented 1 year ago
yuvipanda commented 1 year ago

https://cloud.google.com/monitoring/uptime-checks is the blackbox uptime checks that can be used to check if prometheus is up, and it can send alerts to people in different methods.

I would also suggest checking to see if the prometheus on the new node actually needs that much RAM. I don't think it does, but I'll leave it as is for now.