Closed SamuelLarkin closed 12 months ago
Work was performed last year to ensure that TrixieMain and Preemptible queues allocate a fraction of the full host memory on compute nodes by default. The setting in effect is:
DefMemPerNode=48195
It was decided by ITOps/RPS team to use a value around 1/4 (one quarter) of total host RAM. This is not as small as 6G
requested, but rather is a fair compromise between multiple user's jobs: some of which may typically require a larger amount of memory than 6GB by default or they would see many crashes, memory allocation errors.
This setting should avoid the issue of having a node which is 100% utilized based on lacking memory (--mem
) setting alone by a single job submission - but allow enough flexibility to fit upto four jobs at once on the cluster depending on the other resources requested (cpu, gpu) in each users job file.
Please resolve this ticket if it meets the requirement (or comment here otherwise).
I can guess the reason behind the 1/4 of the memory but I think it should be up-to the user to properly specify the correct amount of memory. Probably having 1/4 of the memory the default is a safe bet to safeguard against user that don't specify their memory requirements.
I would like to change the default behavior on trixie for
--mem
to be something very small like6G
instead of all the node's memory. Having the default behavior set to use all memory by default implies that we are requesting node in an exclusive manner aka--exclusive
which results in an awful waste of resources. Someone can request a single GPU or even worse a single CPU and the end result is that job will lock the full node plus its 4 GPUs and prevent other jobs to use the remaining resources of that node. If the default was to use far less memory, jobs could run on that node.