Closed kryczko closed 3 years ago
The maximum node size on partition TrixieMain is the highest number of compute nodes possible to be allocated to a single job during job submission to SLURM queue.
In this case a recent change is now enforcing jobs should not exceed: 15 nodes. This is approximately half of the available nodes in this queue which ensures that the Preemptible queue can run jobs, even when long-running jobs are in TrixieMain queue.
If there is a requirement to use more than 15 nodes for a particular job, please schedule your job in the Preemptible queue or contact: Danny (dot) Damours (at) nrc (dot) ca to discuss.
You may need to perform changes to your code to make it checkpoint (if it does not already).
Right, okay. I normally don't train across that many nodes, because models tend to train poorly with larger batch sizes (even with adaptive learning rates for layers). I do run inference though, and if my system is large enough I could use more than 15 nodes, but I would not require that much time.
What is the maximum node size when running jobs on TrixieMain and why is it being limited?