Closed glemieux closed 1 week ago
Hello Gregory,
Thanks for reporting this. This is actually due to the PBS select line, specifically, because you are only requesting 5 cpus. Under these circumstances, PBS will create a linux cgroup with only 5 cpus, all on the first socket. The mpibind script tries to bind processes across both sockets, to give your job full memory bandwidth, however, core #s > 4 won't exist in the PBS cgroup, hence this failure. So, to get your case to run immediately, try rerunning with 128 CPUs and 5 MPI ranks in the select line, e.g. with something similar to:
#PBS -l select=1:ncpus=128:mpiprocs=5:ompthreads=1:mem=230GB
In general, regardless of how many CPUs you intend to use, you should always request 128 on a derecho node so that you have access to full memory performance.
On the mpibind side, I'll add some code to catch this type of request, and exit gracefully with a more meaningful error message.
Thanks again for the report.
Thanks for the detailed explanation @roryck.
@jedwards4b, should I add this as an issue to ccs_config_cesm for an update to config_batch.xml?
Resolved with modified PBS select
This issue was discovered after ctsm updated the
ccs_confim_cesm
version toccs_config_cesm0.0.92
(https://github.com/ESCOMP/CTSM/pull/2416). Since then, ctsm test cases using5x5_amazon
resolution are failing to run with the following error:cesm.log
mpibind.log