Why are we tracking memory limit as oppposed to memory request?

naved001 commented 1 year ago

https://github.com/OCP-on-NERC/xdmod-openshift-scripts/blob/b534df90573263131a40e299289647562fe0f37b/openshift_metrics/openshift_prometheus_metrics.py#L26

'kube_pod_resource_request{unit="bytes"}' is what the pod is guaranteed and is used for scheduling purposes.

tzumainn commented 1 year ago

Oh, that kind of makes sense. It sounds like a pod can go over the request though (but not over the limit); does the amount reported in the request metric go up if that happens?

tzumainn commented 1 year ago

Looking at the documentation a bit more, it seems as if a pod can go up to the limit at any time. Given that, I do think the limit metric is the correct one.

Put another way, if a user didn't intend to potentially go past the requested amount to the limit, why wouldn't they simply set the limit to the requested?

tzumainn commented 1 year ago

Er, thinking about it even more simply - to me it sounds like 'request' is the minimum and 'limit' is the max, and the pod usage can fluctuate anywhere between the two. Wouldn't we want to keep track of the max in that case?

naved001 commented 1 year ago

Wouldn't we want to keep track of the max in that case?

If there was an easy way to track that then sure.

To be clear, the metric 'kube_pod_resource_limit{unit="bytes"}' tells what's the max memory a pod can burst up to, while 'kube_pod_resource_request{unit="bytes"}' tells what's the minimum guaranteed memory it'll be allocated if scheduled. The metric doesn't tell us if the pod is currently at "request" amount of memory or "limit" amount of memory. This also applies to CPU request and limit. And just so we don't overcharge a customer, I think it's safer to charge the requested amount (because that will be guaranteed if scheduled).

tzumainn commented 1 year ago

I guess my question would be - why would someone put limit higher than request if they didn't want to accommodate for the possibility that their application would end up using that extra memory?

joachimweyl commented 1 year ago

@syockel & @waygil my understanding is that we want to charge on requested because that is the level they will be at if we need the rest of the resources, if we are not using the rest they get some extra. In the beginning, when there is little usage everyone will probably get to use "limit" level usage. As time goes on and we utilize our notes closer to capacity users will start to see numbers closer to the requested value.

@tzumainn is there a way to track usage? That could be another option, charge directly on usage.

naved001 commented 1 year ago

In the beginning, when there is little usage everyone will probably get to use "limit" level usage.

Even if clusterwide usage is low a project cannot exceed their own quota in their namespace, so the pods may end up running with the requested resources.

syockel commented 1 year ago

What we need to understand is that if multiple pods/containers are using physical hardware, what amount of memory is available for others? So if kube_pod_resource_limit is what is scheduled, the rest is available to be scheduled for another pod, right? That is what matters. So just do the experiment and launch multiple pods and see if you can access the available memory in the case of the "limit" and the "requested" amounts. So if a pod had a high limit, but is only requesting small amounts, is that free to be scheduled for another pod?

naved001 commented 1 year ago

I did some tests and verified that setting limits.memory (or limits.cpu) is not considered when scheduling. Only requests.memory or requests.cpu affects scheduling decision. So, setting a really high limits.memory does not prevent other pods from being scheduled.

I successfully created 2 pods whose sum of limits.memory exceeded that of the host but both of those were still run on the host. In pod definition I set

spec:
  nodeSelector:
    kubernetes.io/hostname: oct-5-05-compute.ocp-prod.massopen.cloud

which caused our pods to only run on that host. The host had a total memory of 277 GiB. I then created a couple of pods with resources.limit.memory set to 150GiB each (so a total of 300 GiB) and the resources.requests.memory set to lower (10GiB), and both pods were running.

I then swapped the values of limit and requests i.e. the requests.memory was now 150GiB for each pod and in that case I was only able to run one of the pod the other was stuck in pending state.

joachimweyl commented 11 months ago

@tzumainn I belive this is resolved can you confirm?

tzumainn commented 11 months ago

@naved001 I kinda lost the thread on this one - are you happy with the memory metric you're using now?

naved001 commented 11 months ago

Yes, I am happy with the memory metric I am using i.e. requests.memory but that change is not reflected in this repo as this is no longer in use for now.

OCP-on-NERC / xdmod-openshift-scripts

Why are we tracking memory limit as oppposed to memory request? #9