Setting a devfile value for `cpuLimit` should not override the default value for `cpuRequest`

cgruver commented 1 week ago

Describe the bug

As a developer, if I configure my devfile with cpuLimit, I expect that the default value for cpuRequest will be used, since I did not modify it.

However, when my workspace is started the resources.requests.cpu value for the container will be equal to the resources.requests.cpu. This behavior can result in under-utilized cluster nodes which cannot have any more Pods scheduled.

Che version

7.90

Steps to reproduce

Create a workspace from: https://github.com/cgruver/devspaces-tiny-workspace.git
Modify the devfile.yaml by adding cpuLimit & memoryLimit

schemaVersion: 2.2.0
attributes:
  controller.devfile.io/storage-type: per-workspace
metadata:
  name: tiny-workspace
components:
- name: dev-tools
  container:
    image: quay.io/cgruver0/che/tiny-workspace
    cpuLimit: 1000m
    memoryLimit: 5Gi

Restart the workspace from the local devfile.yaml
Observe the Pod of the workspace:

  containers:
    - resources:
        limits:
          cpu: 1500m
          memory: 6Gi
        requests:
          cpu: 1500m
          memory: 320Mi

Expected behavior

I expect that there is a default value for resources.requests.cpu

Runtime

OpenShift

Screenshots

No response

Installation method

OperatorHub

Environment

macOS

Eclipse Che Logs

No response

Additional context

No response

ibuziuk commented 1 week ago

@cgruver I might be missing smth. but why don't you specify requests on devfile level if you want them to be applied?

I believe if namespace has the LimitRanges set with default requests those would be applied

The current behaviour is expected even though it is a nuanced approach with the limit = request strategy

cgruver commented 1 week ago

Yeah, I understand the approach. And I am also a HUGE advocate for developer autonomy. But in this particular case, I have a different opinion derived from hard experience.

In an enterprise with a couple thousand developers writing devfiles, it is very common for them to not grasp the consequences of their configuration.

I have seen too many underutilized but over-scaled OpenShift clusters because developers are setting their resources.cpu.request to an unreasonably high value.

If a platform engineer can set a reasonable value for workspace requests and limits as a default, then more experienced developers can set limit values higher as needed, without causing the Kubernetes scheduler to consider a node full, when it is not.

IMO, it's a different paradigm from production workloads which may need a certain guarantee. Developer workloads are much more dynamic, and spend a lot more time idle.

ibuziuk commented 1 week ago

Well, Regarding CPU limits in general, I advocate not to set them at all - if you set them your workloads are throttled by definition.

Limits for CPU for soft-tenancy pods are probably not going to be helpful unless you’re approaching very dense setups (> 10 pods per core) - otherwise, you will waste more CPU throttling than you save. CPU Limits definitely increase tail latencies for most non-predictable workloads (almost all request-driven use cases) in a way that will result in a worse overall application environment for most users most of the time (because of how limits are sliced). At lower pods per core, you are almost certainly trading a false security for a worse quality of service.

CPU Limits are most useful when dealing with bad actors on your own platform, and even then, there are far more effective ways of dealing with bad actors like detection and account blocking. I would tell not to put limits on that and instead put more effort into load monitoring, capacity planning, and reactive scheduling. Unless limits are so high as to bypass all protection for normal use, you’re doing the service users harm.

it is very common for them to not grasp the consequences of their configuration.

Limits / Requests for CPU is a very subtle contended topic - no matter what config engineers put in place with or without CPU limits they do need to understand exactly what they are doing.

cgruver commented 1 week ago

I totally agree with you with respect to CPU limits... except for developer CDE workloads. The one reason that I would advocate for limits, is the noisy neighbor affect.

My opinion in this matter changed because of experience with multi-threaded developer workloads that inadvertently saturated the CPU of a node, impacting the other workspaces on the node. Unlike tested application deployments, developer CDE workloads are very unpredictable.

It's definitely a nuanced topic.

I've seen it happen enough to believe that there's a reasonable approach to be considered.

This new capability may be sufficient - https://github.com/eclipse-che/che/issues/23176

eclipse-che / che