OCP-on-NERC / xdmod-openshift-scripts

0 stars 1 forks source link

Gather metrics for pods that are scheduled. #10

Open naved001 opened 1 year ago

naved001 commented 1 year ago

https://github.com/OCP-on-NERC/xdmod-openshift-scripts/blob/b534df90573263131a40e299289647562fe0f37b/openshift_metrics/openshift_prometheus_metrics.py#L24

This metric will gather cpu request by all pods regardless of if they are running or not.

So, if you had a pod that could not be scheduled we will still end up counting it's CPU requests.

I discovered this when I was trying to gather GPU usage data for the NERC openshift cluster, there was a pod that requested a GPU but it was never scheduled as the cluster does not have an active GPU.

One possible solution is to get an intersection like this: https://github.com/naved001/xdmod-openshift-scripts/blob/d75e06698961a5b9f4db0ac4e86f4e11b30a41a8/openshift_metrics/openshift_prometheus_metrics.py#L26

it worked when I queried GPU metrics, but when I applied this intersection for CPU and Memory I got a 422 error code from prometheus and thanos. :/

tzumainn commented 1 year ago

Oh, interesting. So it sounds like we should find a similar join that works for CPU/Memory?

naved001 commented 1 year ago

I have settled on using the unless operator which is used for intersection.

vector1 unless vector2 results in a vector consisting of the elements of vector1 for which there are no elements in vector2 with exactly matching label sets. All matching elements in both vectors are dropped.

'kube_pod_resource_request{unit="cores"} unless on(pod, namespace) kube_pod_status_unschedulable'

So, this will collect "cores" request for pods that are not unschedulable (not unschedulable == schedulable), and it uses (pod, namespace) to match between the two vectors since these won't have all the same labels.

And this works better than the old way, I no longer get 422 error code. The reason for that error code was that the old query resulted in a many-to-many match sometimes which is not allowed.