LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

RZNevada -- concurrent job runs fail #58

Closed dawson6 closed 2 years ago

dawson6 commented 3 years ago

The 7.0.5 version of ats uses slurm options to run cocurrent jobs. This works on alastor, genie, etc.

On rznevada this fails. While ATS can run jobs one after another (using the --sequential command line option), when two or more jobs are started concurrently, the jobs fail with

srun --exclusive --mpibind=off --distribution=block --nodes=1-2 --cpus-per-task=1 --ntasks=2

0: Fri Jul 23 10:59:54 2021: [PE_0]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use 0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed 0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1 1: Fri Jul 23 10:59:54 2021: [PE_1]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use 1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_inet_listen_socket_setup:socket setup failed 1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1

dawson6 commented 3 years ago

OK, able to reproduce outside of ATS with my 'mpi' test app. Allocated 2 nodes (62 cpus each node) and when I ran this, hit the same issue srun --exclusive --mpibind=off --nodes=1-2 --ntasks=32 --cpus-per-task=1 ./a.out job1 & srun --exclusive --mpibind=off --nodes=1-2 --ntasks=32 --cpus-per-task=1 ./a.out job2 &

dawson6 commented 3 years ago

If I leave off the --exclusive option then this does run, but the jobs queue up and effectively run sequentiall.