"large" clusters don't guarantee that you'll get a full node and could cause mem/CPU issues

bolliger32 commented 4 years ago

We bump the "limits" for large clusters, but we don't bump the "requests", so it's possible (likely?) that you'll get put on a node with another user, and if you're trying to use the full memory, you're probably going to run into issues if the other user is trying to use some of the memory they've got allocated. You may not actually be able to use more than 1/2 the node memory, since the other user will also have a request of half of the memory.

To illustrate this, I recently requested a "large" container and these are some of the diagnostics:

Showing that my requested amount is half my limit amount

Showing @delgadom and I placed on the same node, with a total CPU limit that is far greater than the node could handle.

brews commented 4 years ago

This is already closed but for what it's worth:

Unlike CPU you can't throttle-back memory once it has been allocated by a system (things need to be killed to get the memory back), so defining resource "requests" and "limits" in k8s has different implications depending on whether you're talking memory or CPU.

In this case, if you want to be promised that amount of memory on your node when your job starts (assuming the node can meet your request), you're going to want to "request" that memory. If you just set the "limit" for memory you're only setting the ceiling at which your workload is going to get killed, it's not a guarantee of resources availability.

bolliger32 commented 4 years ago

Agreed. I think that #13 adding requests should get us what I think we're looking for, which is a node all to ourselves if we select a "large" container.

Do we want to deploy this change?

RhodiumGroup / helm-chart

"large" clusters don't guarantee that you'll get a full node and could cause mem/CPU issues #12