There were a couple of things going on here. We had lost a default memory
setting which means anything submitted without a memory request (which is a
mistake in the first place) was grabbing too much RAM.
For your jobs, you have:
SBATCH --ntask=1
but "ntask" isn't a valid resource request: it should be "ntasks"; sbatch warns
you about this (but accepts the job) which I guess you might not have noticed
if the job submission was automated. You're also wanting 16 CPUs per task,
which needs all the cores within a node (i.e. simply counting cores isn't
enough when trying to work out if your job should be running).
You're also requesting multiple cores but without a per-core memory request:
SBATCH --mem=2000
If you wanted 2 GB per core, then:
SBATCH --mem-per-cpu=2000
is what you need.
Anyway, the system default memory setting is fixed for new job submissions so
I'll resolve this now. If you think you need to re-open it, please make sure
you've first corrected your own batch scripts and that any issues affect new
job submissions once currently running jobs have come off (i.e. wait a day or
so).
https://bugzilla.csc.warwick.ac.uk/bugzilla/show_bug.cgi?id=13092
--- Comment #1 from Matthew Ismail ccsyab@warwick.ac.uk --- Hi Rudo,
There were a couple of things going on here. We had lost a default memory setting which means anything submitted without a memory request (which is a mistake in the first place) was grabbing too much RAM.
For your jobs, you have:
SBATCH --ntask=1
but "ntask" isn't a valid resource request: it should be "ntasks"; sbatch warns you about this (but accepts the job) which I guess you might not have noticed if the job submission was automated. You're also wanting 16 CPUs per task, which needs all the cores within a node (i.e. simply counting cores isn't enough when trying to work out if your job should be running).
You're also requesting multiple cores but without a per-core memory request:
SBATCH --mem=2000
If you wanted 2 GB per core, then:
SBATCH --mem-per-cpu=2000
is what you need.
Anyway, the system default memory setting is fixed for new job submissions so I'll resolve this now. If you think you need to re-open it, please make sure you've first corrected your own batch scripts and that any issues affect new job submissions once currently running jobs have come off (i.e. wait a day or so).
Thanks, Matt