HARPgroup / cbp_wsm

1 stars 0 forks source link

slurm fails with sbatch: error: Memory specification can not be satisfied #82

Closed rburghol closed 1 year ago

rburghol commented 1 year ago

This occurs because of a change in OneCommandWSM.csh, when the switch --mem-per-cpu=1000 is used. Since the default configuration on our node has RealMemory=1, it will not run unless we either omit `--mem-per-cpu, or set it to --mem-per-cpu=1 or --mem-per-cpu=0 also works. I tried variations of 500, 100, and 10` and they all failed.

Command: sbatch --mem-per-cpu=1000 --nice=2000 --output=/opt/model/p6/vadeq/tmp/rob-slurm/vadeq_2021_2022-09-01-17-05-32.out --error=/opt/model/p6/vadeq/tmp/rob-slurm/vadeq_2021_2022-09-01-17-05-32.out --job-name=vadeq_2021 --dependency=singleton --nodes=1 --ntasks=1 --mail-type=FAIL --mail-user=rburghol@vt.edu bhatt_one_command_wsm.csh vadeq_2021 /opt/model/p6/vadeq/tmp/rob-slurm/vadeq_2021_2022-09-01-17-05-32.out /opt/model/p6/vadeq/tmp/rob-logs/vadeq_2021_2022-09-01-17-05-32.log

Output:

sbatch: error: NodeNames=deq2 Sockets=0 is invalid, reset to 1
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

Show node info scontrol show nodes

scontrol: error: NodeNames=deq2 Sockets=0 is invalid, reset to 1
NodeName=deq2 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=1 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=deq2 NodeHostName=deq2 Version=19.05.5
   OS=Linux 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022
   RealMemory=1 AllocMem=0 FreeMem=1266 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2022-08-06T18:33:34 SlurmdStartTime=2022-08-06T18:33:56
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s