Closed rvencu closed 2 years ago
As most problems happen between the head node and compute fleet communication, I need however to mention that we use 3 targeted capacity reservations for p4ds. They are all in us-east-1d but I wonder since they are separated, combining them could be a cause for network congestion? Maybe we need to use separate clusters, one for each capacity reservation?
Hi @rvencu
I don't think it's a problem due to the different subnets usage, but probably related to the instance type you are using for the head node.
Can you share your config? I'd like to see which instance type you're using for the Head Node and the shared storages you have configured.
I'd suggest to start reading Best Practices, then the selection of the instance type really depends by the application/jobs you're submitting.
Do you have EFA enabled for the compute instances? This is important to minimize latencies between instances. I'd suggest using a c5n.18xlarge for the head node (you can compare Network bandwidth here).
You can also verify if there are dropped packages by executing ethtool -S eth0 | grep exceeded
from the head node.
Then, if your application is performing continuous I/O in the shared folders, I'd suggest to use a FSx storage rather than using an EBS, because EBS are shared through NFS from the head node.
Hi. Config here https://github.com/rvencu/1click-hpc/blob/main/parallelcluster/config.us-east-1.sample.yaml
We are using FSx for Lustre but it is true some people might launch their workloads from the /home which is the EBS volume
Headnode and compute are all in us-east-1d
We experience the problems in compute-od-gpu partition_
ethtool -S eth0 | grep exceeded bw_in_allowance_exceeded: 110555987 bw_out_allowance_exceeded: 488718524 pps_allowance_exceeded: 0 conntrack_allowance_exceeded: 0 linklocal_allowance_exceeded: 76303
Thanks for the configuration.
For compute-to-compute communication I see you're already using a Placement Group and you have EFA enabled so this should be enough to have good performances.
Anyway the ethtool
output is showing you're having networking issues and dropped packages in the head node, it means this is the bottleneck of your cluster.
Your instance type m6i.16xlarge
has 25 Gbps of Networking bandwidth. I'd suggest to move to another instance type with more bandwidth (e.g. m6i.32xlarge
50 Gbps or c5n.18xlarge
100 Gbps).
I think this change, with the change of avoid using /home
to start the workloads will be enough to solve this issues.
Please note you can find some useful metrics about the Head Node usage in the CloudWatch Dashboard created within the cluster.
Enrico
Thank you for the suggestions. Right now I cannot deploy a new cluster because the missing pip dependency at the installation of parallelcluster 3.1.4. But as soon as this is sorted out I will deploy a new version with these improvements
For the pip dependency issue I'd strongly suggest to use a Python virtual environment for the installation, in this way all the deps required by ParallelCluster will stay isolated in that virtual environment and there will be no conflicts with the Python packages already installed in the instances.
Anyway for this other thread we can keep the discussion open on the other thread.
I'm going to close this issue. Feel free to open a new one or add more comments if other info are needed.
Enrico
Bumped the headnode to max hopefully there will be no more congestion, will monitor that
However openmpi still has issues in compute-to-compute connections when I use more than 36 nodes or so as in
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: compute-od-gpu-dy-p4d-24xlarge-2
Remote host: compute-od-gpu-dy-p4d-24xlarge-25
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
headnode upgrade did not help with this.
intelmpi however seems ok
switched to 100Gbps card and running nccl-test script from /fsx we still get this
ethtool -S eth0 | grep exceeded
bw_in_allowance_exceeded: 0
bw_out_allowance_exceeded: 27219
pps_allowance_exceeded: 0
conntrack_allowance_exceeded: 0
linklocal_allowance_exceeded: 0
Hi @rvencu
if you're using FSx let me share some other FSx optimization best practices.
The bw_out_allowance_exceeded describes the number of packets queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html
Then you said you're running NCCL tests, did you follow official instructions ?
@rvencu
Regarding your p4d usage, I am not sure that EFA is being used properly for NCCL test.
When running NCCL test, Open MPI is used only as a launcher, e.g. the actual network traffic does not go through Open MPI. The actual network traffic was through NCCL. If configured properly, NCCL will use aws-ofi-nccl plugin, which use libfabric's EFA proivder for data transfer. If not using EFA, NCCL will use its sockets provider.
Can you share your command to run nccl-test and output? (preferably with -x NCCL_DEBUG=info
added to mpirun
command (for openmpi)
Thanks. I am not so sure fsx is involved because we do not even get to the point where the workload script is invoked. usually things break at the launcher stage, the headnode being unable to launch a large number of compute jobs. What I mean by that is
I do not understand why openmpi or srun techniques fail where intelmpi succeeds. I noticed intelmpi executes some kind of passwordless ssh to connect to the nodes. Maybe the former ones try some file sharing which contests the network. Also making the ~/.cache a symlink to an actual folder on /fsx seems to also help
@rvencu
Regarding your p4d usage, I am not sure that EFA is being used properly for NCCL test.
When running NCCL test, Open MPI is used only as a launcher, e.g. the actual network traffic does not go through Open MPI. The actual network traffic was through NCCL. If configured properly, NCCL will use aws-ofi-nccl plugin, which use libfabric's EFA proivder for data transfer. If not using EFA, NCCL will use its sockets provider.
Can you share your command to run nccl-test and output? (preferably with
-x NCCL_DEBUG=info
added tompirun
command (for openmpi)
we observed libfabric EFA loaded properly on nccl tests run with 4 nodes only. we get some 40GB/s average busbw
as an experiment I would try to use intelmpi instead of openmpi for nccl tests. but I could not make it work. I noticed nccl-tests were built with MPI support, not sure if they took something from openmpi as a dependency
this is how I install all nccl stack including tests https://github.com/rvencu/1click-hpc/blob/main/modules/45.install.nccl.compute.sh
we observed libfabric EFA loaded properly on nccl tests run with 4 nodes only. we get some 40GB/s average busbw
I see. Thanks!
So the problem is that Open MPI failed to lauch job. When you use open mpi's mpirun
, were you specifying a hostfile or rely on slurm?
we rely on slurm indeed
we are investigating slurm db connections that seems to be in trouble. the rds instance we used is minimal and apparently the DB is not able to answer many queries at once when we try to launch massive workloads
we are investigating slurm db connections that seems to be in trouble. the rds instance we used is minimal and apparently the DB is not able to answer many queries at once when we try to launch massive workloads
Thank you! I think that using openmpi's mpirun
with hostfile option would workaround the issue.
we made a bigger db instance and this eliminated the db errors from the log but actually not more than this. still having the same issues
can you recommend how to use the hostfile option in parallelcluster context? we do not have compute nodes active, they get spin up every time and hostnames and IPs are unknown at launch time
we get these kind of errors now
[2022-07-11T19:54:58.512] error: slurm_set_addr: Unable to resolve "compute-od-gpu-dy-p4d-24xlarge-54"
[2022-07-11T19:54:58.512] error: fwd_tree_thread: can't find address for host compute-od-gpu-dy-p4d-24xlarge-54, check slurm.conf
We started a more in depth debugging process and got some partial success by tuning some timeouts from slurm.conf and potentially ifconfig eth0 txqueuelen 4096
also seemed to help. We did not get to see how much we can scale now but at least we broke previous threshold
Thanks @rvencu for sharing all your improvements with the community.
According to OpenMPI documentation:
Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Slurm directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Source: https://www.open-mpi.org/faq/?category=slurm
Anyway if you want to give a try you can generate the hostfile with a submission script like:
#!/bin/bash
#SBATCH ...
module load openmpi
scontrol show hostnames $SLURM_NODELIST > hostlist
mpirun --hostfile hostlist ...
note that $SLURM_NODELIST
is only available within the environment of the batch submission script, when the nodes are already allocated to the job:
https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
So the main change in slurm.conf was this
MessageTimeout=180
To summarize all 3 changes that improved our installation:
ifconfig eth0 txqueuelen 4096
we will benchmark how far can we scale up now at a later time when more p4d nodes will be available
we noticed great reduction of headnode congestions but now the larger the scale,more compute to compute connections fail. and I believe we are in the nodes becoming ready to launch the job phase are losing IP connectivity to one another just to say: hey I am alive. I do not believe EFA is even triggered at this stage
also slurm messaging or openmpi messaging is more impacted than intelmessaging
we tried some timing debugging and we see the rate at which nodes become available is slow, too slow to feel normal
@rvencu could you share how are you submitting the job (mpirun
vs srun
vs sbatch
) and eventually the submission script? I'm asking to understand if there are other parameters we can suggest to tune out your environment (e.g. specifying some OMPI_*
env variable).
When you're talking about the Database, do you mean you have Slurm accounting enabled? Can you share the details of the issues related to the database when you were using a small db size?
Could you also check the if there are NFS related errors in /var/log/messages
or by running a tool like nfsstat
?
It might be useful to check VPC Flow Logs to verify REJECTED messages, see:
Then, it's good to double check if you're following all the best practices on Slurm High Throughput official doc.
Enrico
thanks. here are my nccl tests that fail if there are too many nodes
#!/bin/bash
#SBATCH --partition=compute-od-gpu
#SBATCH --job-name=nccl-tests
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH --output=%x_%j.out
module load openmpi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nccl/build/lib:/opt/aws-ofi-nccl-install/lib
export NCCL_PROTO=simple
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/aws-ofi-nccl/lib
export PATH=$PATH:/opt/amazon/efa/bin:/opt/amazon/openmpi/bin
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
export NCCL_DEBUG=info
export OMPI_MCA_mtl_base_verbose=1
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_TREE_THRESHOLD=0
mpirun -n 160 -N 8 --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker1 --bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 128M -e 8G -f 2 -g 1 -c 1 -n 20
this uses openmpi mpirun obviously. with intelmpi we just load the intelmpi module and source the env as instructed, everything else stays the same
we submit with the sbatch <above file>
Yes, we have mySQL on RDS instance and want to use slurm accounting but right now nothing is configured except slurmdbd running as a service. with a small DB we encountered errors consistent with the database not being able to cope with the number of requests. Though we had some bug in the slurmdbd service definition that also made the process die so I would not try to debug this issue just yet.
cat /var/log/messages | grep NFS
returns empty response
tonight we get 200 more p4d instances into a new ODCR. We will restart scaling tests since we have this capacity. I just activated the VPC flow logs for the entire VPC but filtered to REJECTED
We read the high throughput official doc from slurm but of course we will double check
Will try to capture and bring more logs from upcoming tests
The following is a gist making an example of failing python script using pytorch ddp https://gist.github.com/rom1504/8da2e461e7416537c03170138e21f995
we had a controller meltdown last night without traces in the log file. restarting the slurmctld service has restored the controller.
the reason I am mentioning this here is that I saw this line in the log when I restarted the service:
error: MessageTimeout is too high for effective fault-tolerance
perhaps we will need to revisit the setting we modified above and solve the root cause, not the symptom...
we have rejected connections on the flow logs like this
2 8.42865E+11 eni-07a4f3018b5329a07 172.31.18.55 172.31.41.163 3306 46614 6 1 40 1658386089 1658386121 REJECT OK
2 8.42865E+11 eni-07a4f3018b5329a07 172.31.18.55 172.31.41.163 3306 46640 6 1 40 1658386089 1658386121 REJECT OK
2 8.42865E+11 eni-07a4f3018b5329a07 172.31.18.55 172.31.41.163 3306 46612 6 1 40 1658386089 1658386121 REJECT OK
2 8.42865E+11 eni-07a4f3018b5329a07 172.31.18.55 172.31.41.163 3306 46892 6 1 40 1658386150 1658386181 REJECT OK
2 8.42865E+11 eni-0944da571db9d820b 172.31.32.220 172.31.45.154 3306 47276 6 1 40 1658386087 1658386119 REJECT OK
2 8.42865E+11 eni-0944da571db9d820b 172.31.32.220 172.31.45.154 3306 47164 6 1 40 1658386087 1658386119 REJECT OK
2 8.42865E+11 eni-0944da571db9d820b 172.31.32.220 172.31.45.154 3306 47162 6 1 40 1658386087 1658386119 REJECT OK
2 8.42865E+11 eni-0944da571db9d820b 172.31.32.220 172.31.45.154 3306 48688 6 2 80 1658386147 1658386179 REJECT OK
I filtered source and destination to be from our VPC CIDR range. I am not sure if these are related to failed jobs, I think I need to log the nodes IPs then search in logs again
I tracked down 2 nodes that could not communicate and there were no rejected attempts in the flow logs
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: compute-od-gpu-dy-p4d-24xlarge-6
Remote host: compute-od-gpu-dy-p4d-24xlarge-33
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
^C
[ec2-user@ip-172-31-45-154 shared]$ ping compute-od-gpu-dy-p4d-24xlarge-6
PING compute-od-gpu-dy-p4d-24xlarge-6.hpc-1click-production4.pcluster (172.31.233.150) 56(84) bytes of data.
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=1 ttl=255 time=0.216 ms
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=2 ttl=255 time=0.180 ms
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=3 ttl=255 time=0.173 ms
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=4 ttl=255 time=0.184 ms
^C
--- compute-od-gpu-dy-p4d-24xlarge-6.hpc-1click-production4.pcluster ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3052ms
rtt min/avg/max/mdev = 0.173/0.188/0.216/0.019 ms
[ec2-user@ip-172-31-45-154 shared]$ ping compute-od-gpu-dy-p4d-24xlarge-33
PING compute-od-gpu-dy-p4d-24xlarge-33.hpc-1click-production4.pcluster (172.31.238.0) 56(84) bytes of data.
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=1 ttl=255 time=0.215 ms
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=2 ttl=255 time=0.186 ms
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=3 ttl=255 time=0.184 ms
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=4 ttl=255 time=0.178 ms
172.31.235.211
172.31.233.150
172.31.232.204
172.31.231.18
172.31.230.184
172.31.238.0
172.31.233.189
172.31.232.55
I searched for all source and destinations like 172.31.23?.??? and there were no records
this leads me to think what is really happening is that the node 6 came alive earlier, found the list of all upcoming other nodes and tried to connect to them. node 33 came up too late, after the connection timeout
we need to understand why the nodes come up so slowly one after another. might be the post install script taking too long (while I do not see why the network should not work during that time)? or simply spinning up many p4d nodes at once is forming some kind of slow queue in AWS virtualization? or an artefact of the ODCR mechanism....
Spun up a cluster with 160 nodes p4d permanently up and retried nccl-tests. Same issue more than 36 nodes and tests fail the same way. So time to bring up the nodes and run post-install scripts does not matter.
Hi @rvencu ,
I just want to get some clarity on your latest test:
Spun up a cluster with 160 nodes p4d permanently up and retried nccl-tests.
Did all of the compute nodes successfully start? Was this without the post install script?
I spun up 160 nodes (down from 174 presented as available to avoid last minute insufficient capacity errors). They are with post-install scripts and all. I guess the only script that is not ran is the slurm prolog
I added some custom things in that one but the problem manifest the same as before. custom things were done to tame the cuda in the containers
https://github.com/rvencu/1click-hpc/blob/main/scripts/prolog.sh
At least this is my current understanding - when launched the compute node execute post-install scripts automatically while prolog is executed before a job is started
Actually the only thing that seems to run better (does not mean perfect) from all our tests is:
@rvencu -- did you try specifying the hostfile directly with openmpi (see: https://github.com/aws/aws-parallelcluster/issues/4179#issuecomment-1181565998)? Curious if that changes the behavior at all?
this is the result with the explicit hostfile, slightly different but the same (maybe it is headnode to compute problem)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
well going to square 1, and trying super basic things on 42 nodes
debug.sh
START_TIME=$JOBSTART
echo "$(($(date +%s) - $START_TIME)) : $(hostname)"
test.sh
#!/bin/bash
#SBATCH --partition=compute-od-gpu
#SBATCH --job-name=test
#SBATCH --nodes=42
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH --output=%x_%j.out
module load openmpi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nccl/build/lib:/opt/aws-ofi-nccl-install/lib
export NCCL_PROTO=simple
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/aws-ofi-nccl/lib
export PATH=$PATH:/opt/amazon/efa/bin:/opt/amazon/openmpi/bin
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
export NCCL_DEBUG=info
export OMPI_MCA_mtl_base_verbose=1
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_TREE_THRESHOLD=0
export JOBSTART=$(date +%s)
srun sh debug.sh
result:
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
srun: error: slurm_receive_msgs: [[ip-172-31-224-213.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-237-223.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-222.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-160.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-227-160.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-193.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-238-216.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-196.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-226-192.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-230-255.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-238-197.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-224-203.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-234-202.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-236-235.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-38.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-82.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-230-207.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-231-204.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-227-193.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-232-219.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-233-200.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-31.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-208.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-84.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-226-249.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-224-83.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-237.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-234-84.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-39.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-225-245.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-3: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-5: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-6: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-7: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-8: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-9: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-11: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-17: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-13: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-14: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-15: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-16: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-18: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-19: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-20: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-42: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-21: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-24: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-27: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-26: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-28: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-29: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-40: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-31: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-32: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-33: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-38: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-35: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-39: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-41: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
srun: error: Timed out waiting for job step to complete
it seems that even 36 nodes barely work
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
srun: error: slurm_receive_msgs: [[ip-172-31-226-213.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-222.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-237-223.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-226-192.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-236-235.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-34.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-4: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-6: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-5: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-13: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-19: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-23: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-od-gpu-st-p4d-24xlarge-6: tasks 40-47: Killed
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
Hi @rvencu -- this is valuable new information, are the new tests done with a post-install script as well?
Out of curiosity what is the OS you are using?
Can you also paste the full cluster config and link/paste post-install script if you're using one?
It would be great to see the other side of this bisection effort (eg a case where the nodes boot successfully) of you haven't already can you remove the prolog and post install to see if the nodes are able to start without any additional customizations?
Before attempting above, here is incomplete result of manual experimenting with various setups like: native openmpi vs openmpi with srun, vs native intelmpi vs intelmpi with srun
This explains why native intelmpi run (I included the -bootstrap slurm
though) can scale up much more than the rest
I did this with all post-install scripts enabled, indeed. Nodes were already running from AWS point of view. This is time purely taken by slurm/mpi to set the tasks
The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability.
The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability.
We clearly need to learn how to design the cascading scalability and attempt that... Do you have any documentation suggestions?
The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability.
We clearly need to learn how to design the cascading scalability and attempt that... Do you have any documentation suggestions?
It's kind of difficult to find a clear document on this. (1) I'd start with https://www.open-mpi.org/faq/?category=large-clusters. (2) Try to use OpenMPI built-in scheduler integration for launching tasks (--mca parameter).
@rvencu -- I am curious also if it is related to this issue: https://github.com/open-mpi/ompi/issues/4578 and perhaps adding -mca plm_rsh_no_tree_spawn 1
to mpirun would help?
It would be great to see the other side of this bisection effort (eg a case where the nodes boot successfully) of you haven't already can you remove the prolog and post install to see if the nodes are able to start without any additional customizations?
I am preparing to do so. In the meanwhile slurm support suggested this
The prolog.sh attached is run on every step which could slow startup. If you want your job to start and not wait for the prolog, set this in slurm.conf:
> PrologFlags=NoHold
However, looking at the prolog.sh, it appears to be a job script, not a per-step script. It is also doing RPCs to the controller which is explicitly ill-advised in a prolog/epilog script.
I suggest swapping:
> Prolog=/opt/slurm/etc/prolog.sh
to
> PrologSlurmctld=/opt/slurm/etc/prolog.sh
in the slurm.conf and restarting all daemons.
Will update here when I have results
As a sidenote, running so many test jobs with large number of nodes and watching squeue I noticed that releasing the nodes is also a long process, and the nodes are released slowly, with similar timing as we got at start
And there is no epilog script in the configuration. Thought this is worth mentioning
Yes -- I also note that the prolog script makes calls to the AWS api which risks throttling at higher scale which I why I wanted to dig into how much of an effect that might be causing versus potentially other issues.
ok, getting prolog out of config makes the launch timing to fall to 1 second for 40 nodes so yes, this seems to be the root cause
I will introduce now the things that seem safe and time them again to detect where is the bottleneck
@rvencu -- excellent!
Are you able to scale to all 200 nodes without the prolog?
Do not have 200 nodes but I ran nccl tests with 120 nodes I found available and it was successful.
Problem in prolog - instances tagging API is slow, need to move that separately, async.
# Out of bounds values : 0 OK
# Avg bus bandwidth : 12.3392
We started an HPC cluster and placed the HeadNode into a public subnet for easy access from the internet while the compute fleet sits on a private subnet.
While we had a limited number of p4d instances available things were running normal but we recently got new capacity reservation and node count went up to 194, allowing us to try large jobs with over 100 nodes / 800 GPUs with EFA enables, these instances having 4 100Gbps network interfaces each (so > 40000 Gbps per job).
We noticed jobs failure due to network errors both in headnode-computenode operations as well as compute-to-compute comms.
Especially openmpi seems sensitive to this, it fails to launch batch jobs for more than 35 nodes. Intelmpi seems more resilient, we were able to run 80 nodes jobs but nothing above that. We also had an event when slurm master lost connection to the entire fleet, sending all compute down and rescheduling all jobs.
I want to get some network recommendations for this scenario. What I am thinking now is to move the headnode in the same private subnet as compute fleet (and provide SSH connectivity via tunneling).
Is there anything that we could do at the level of VPC/subnet configurations to ease up the congestion?