ai4d-iasc / trixie

Scripts and documentation about trixie hpc
17 stars 3 forks source link

Default --mem should NOT be full memory #85

Closed SamuelLarkin closed 12 months ago

SamuelLarkin commented 2 years ago

I would like to change the default behavior on trixie for --mem to be something very small like 6G instead of all the node's memory. Having the default behavior set to use all memory by default implies that we are requesting node in an exclusive manner aka --exclusive which results in an awful waste of resources. Someone can request a single GPU or even worse a single CPU and the end result is that job will lock the full node plus its 4 GPUs and prevent other jobs to use the remaining resources of that node. If the default was to use far less memory, jobs could run on that node.

nrcfieldsa commented 12 months ago

Work was performed last year to ensure that TrixieMain and Preemptible queues allocate a fraction of the full host memory on compute nodes by default. The setting in effect is:

DefMemPerNode=48195

It was decided by ITOps/RPS team to use a value around 1/4 (one quarter) of total host RAM. This is not as small as 6G requested, but rather is a fair compromise between multiple user's jobs: some of which may typically require a larger amount of memory than 6GB by default or they would see many crashes, memory allocation errors.

This setting should avoid the issue of having a node which is 100% utilized based on lacking memory (--mem) setting alone by a single job submission - but allow enough flexibility to fit upto four jobs at once on the cluster depending on the other resources requested (cpu, gpu) in each users job file.

Please resolve this ticket if it meets the requirement (or comment here otherwise).

SamuelLarkin commented 12 months ago

I can guess the reason behind the 1/4 of the memory but I think it should be up-to the user to properly specify the correct amount of memory. Probably having 1/4 of the memory the default is a safe bet to safeguard against user that don't specify their memory requirements.