aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
832 stars 312 forks source link

Pcluster 3.1.4 - network congestion for large scale jobs #4179

Closed rvencu closed 2 years ago

rvencu commented 2 years ago

We started an HPC cluster and placed the HeadNode into a public subnet for easy access from the internet while the compute fleet sits on a private subnet.

While we had a limited number of p4d instances available things were running normal but we recently got new capacity reservation and node count went up to 194, allowing us to try large jobs with over 100 nodes / 800 GPUs with EFA enables, these instances having 4 100Gbps network interfaces each (so > 40000 Gbps per job).

We noticed jobs failure due to network errors both in headnode-computenode operations as well as compute-to-compute comms.

Especially openmpi seems sensitive to this, it fails to launch batch jobs for more than 35 nodes. Intelmpi seems more resilient, we were able to run 80 nodes jobs but nothing above that. We also had an event when slurm master lost connection to the entire fleet, sending all compute down and rescheduling all jobs.

I want to get some network recommendations for this scenario. What I am thinking now is to move the headnode in the same private subnet as compute fleet (and provide SSH connectivity via tunneling).

Is there anything that we could do at the level of VPC/subnet configurations to ease up the congestion?

rvencu commented 2 years ago

As most problems happen between the head node and compute fleet communication, I need however to mention that we use 3 targeted capacity reservations for p4ds. They are all in us-east-1d but I wonder since they are separated, combining them could be a cause for network congestion? Maybe we need to use separate clusters, one for each capacity reservation?

enrico-usai commented 2 years ago

Hi @rvencu

I don't think it's a problem due to the different subnets usage, but probably related to the instance type you are using for the head node.

Can you share your config? I'd like to see which instance type you're using for the Head Node and the shared storages you have configured.

I'd suggest to start reading Best Practices, then the selection of the instance type really depends by the application/jobs you're submitting.

Do you have EFA enabled for the compute instances? This is important to minimize latencies between instances. I'd suggest using a c5n.18xlarge for the head node (you can compare Network bandwidth here).

You can also verify if there are dropped packages by executing ethtool -S eth0 | grep exceeded from the head node.

Then, if your application is performing continuous I/O in the shared folders, I'd suggest to use a FSx storage rather than using an EBS, because EBS are shared through NFS from the head node.

rvencu commented 2 years ago

Hi. Config here https://github.com/rvencu/1click-hpc/blob/main/parallelcluster/config.us-east-1.sample.yaml

We are using FSx for Lustre but it is true some people might launch their workloads from the /home which is the EBS volume

Headnode and compute are all in us-east-1d

We experience the problems in compute-od-gpu partition_

rvencu commented 2 years ago

ethtool -S eth0 | grep exceeded bw_in_allowance_exceeded: 110555987 bw_out_allowance_exceeded: 488718524 pps_allowance_exceeded: 0 conntrack_allowance_exceeded: 0 linklocal_allowance_exceeded: 76303

enrico-usai commented 2 years ago

Thanks for the configuration.

For compute-to-compute communication I see you're already using a Placement Group and you have EFA enabled so this should be enough to have good performances.

Anyway the ethtool output is showing you're having networking issues and dropped packages in the head node, it means this is the bottleneck of your cluster.

Your instance type m6i.16xlarge has 25 Gbps of Networking bandwidth. I'd suggest to move to another instance type with more bandwidth (e.g. m6i.32xlarge 50 Gbps or c5n.18xlarge 100 Gbps).

I think this change, with the change of avoid using /home to start the workloads will be enough to solve this issues. Please note you can find some useful metrics about the Head Node usage in the CloudWatch Dashboard created within the cluster.

Enrico

rvencu commented 2 years ago

Thank you for the suggestions. Right now I cannot deploy a new cluster because the missing pip dependency at the installation of parallelcluster 3.1.4. But as soon as this is sorted out I will deploy a new version with these improvements

enrico-usai commented 2 years ago

For the pip dependency issue I'd strongly suggest to use a Python virtual environment for the installation, in this way all the deps required by ParallelCluster will stay isolated in that virtual environment and there will be no conflicts with the Python packages already installed in the instances.

Anyway for this other thread we can keep the discussion open on the other thread.

I'm going to close this issue. Feel free to open a new one or add more comments if other info are needed.

Enrico

rvencu commented 2 years ago

Bumped the headnode to max hopefully there will be no more congestion, will monitor that

However openmpi still has issues in compute-to-compute connections when I use more than 36 nodes or so as in

------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    compute-od-gpu-dy-p4d-24xlarge-2
  Remote host:   compute-od-gpu-dy-p4d-24xlarge-25
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------

headnode upgrade did not help with this.

intelmpi however seems ok

rvencu commented 2 years ago

switched to 100Gbps card and running nccl-test script from /fsx we still get this

ethtool -S eth0 | grep exceeded
     bw_in_allowance_exceeded: 0
     bw_out_allowance_exceeded: 27219
     pps_allowance_exceeded: 0
     conntrack_allowance_exceeded: 0
     linklocal_allowance_exceeded: 0
enrico-usai commented 2 years ago

Hi @rvencu

if you're using FSx let me share some other FSx optimization best practices.

The bw_out_allowance_exceeded describes the number of packets queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html

Then you said you're running NCCL tests, did you follow official instructions ?

wzamazon commented 2 years ago

@rvencu

Regarding your p4d usage, I am not sure that EFA is being used properly for NCCL test.

When running NCCL test, Open MPI is used only as a launcher, e.g. the actual network traffic does not go through Open MPI. The actual network traffic was through NCCL. If configured properly, NCCL will use aws-ofi-nccl plugin, which use libfabric's EFA proivder for data transfer. If not using EFA, NCCL will use its sockets provider.

Can you share your command to run nccl-test and output? (preferably with -x NCCL_DEBUG=info added to mpirun command (for openmpi)

rvencu commented 2 years ago

Thanks. I am not so sure fsx is involved because we do not even get to the point where the workload script is invoked. usually things break at the launcher stage, the headnode being unable to launch a large number of compute jobs. What I mean by that is

  1. the nodes are properly launched by slurm and the post install scripts are executed ok on all of them
  2. the launcher (might be openmpi or srun) want to instantiate running scripts in each node using the rank and other parameters needed for the distributed training. This is the step where things get out of hands and fail. instead using intelmpi we do not observe this behaviour.
  3. for a low number of nodes, say 20, we get to this stage where the actual script (python) is launched and executed

I do not understand why openmpi or srun techniques fail where intelmpi succeeds. I noticed intelmpi executes some kind of passwordless ssh to connect to the nodes. Maybe the former ones try some file sharing which contests the network. Also making the ~/.cache a symlink to an actual folder on /fsx seems to also help

rvencu commented 2 years ago

@rvencu

Regarding your p4d usage, I am not sure that EFA is being used properly for NCCL test.

When running NCCL test, Open MPI is used only as a launcher, e.g. the actual network traffic does not go through Open MPI. The actual network traffic was through NCCL. If configured properly, NCCL will use aws-ofi-nccl plugin, which use libfabric's EFA proivder for data transfer. If not using EFA, NCCL will use its sockets provider.

Can you share your command to run nccl-test and output? (preferably with -x NCCL_DEBUG=info added to mpirun command (for openmpi)

we observed libfabric EFA loaded properly on nccl tests run with 4 nodes only. we get some 40GB/s average busbw

as an experiment I would try to use intelmpi instead of openmpi for nccl tests. but I could not make it work. I noticed nccl-tests were built with MPI support, not sure if they took something from openmpi as a dependency

this is how I install all nccl stack including tests https://github.com/rvencu/1click-hpc/blob/main/modules/45.install.nccl.compute.sh

wzamazon commented 2 years ago

we observed libfabric EFA loaded properly on nccl tests run with 4 nodes only. we get some 40GB/s average busbw

I see. Thanks!

So the problem is that Open MPI failed to lauch job. When you use open mpi's mpirun, were you specifying a hostfile or rely on slurm?

rvencu commented 2 years ago

we rely on slurm indeed

rvencu commented 2 years ago

we are investigating slurm db connections that seems to be in trouble. the rds instance we used is minimal and apparently the DB is not able to answer many queries at once when we try to launch massive workloads

wzamazon commented 2 years ago

we are investigating slurm db connections that seems to be in trouble. the rds instance we used is minimal and apparently the DB is not able to answer many queries at once when we try to launch massive workloads

Thank you! I think that using openmpi's mpirun with hostfile option would workaround the issue.

rvencu commented 2 years ago

we made a bigger db instance and this eliminated the db errors from the log but actually not more than this. still having the same issues

can you recommend how to use the hostfile option in parallelcluster context? we do not have compute nodes active, they get spin up every time and hostnames and IPs are unknown at launch time

rvencu commented 2 years ago

we get these kind of errors now

[2022-07-11T19:54:58.512] error: slurm_set_addr: Unable to resolve "compute-od-gpu-dy-p4d-24xlarge-54"
[2022-07-11T19:54:58.512] error: fwd_tree_thread: can't find address for host compute-od-gpu-dy-p4d-24xlarge-54, check slurm.conf
rvencu commented 2 years ago

We started a more in depth debugging process and got some partial success by tuning some timeouts from slurm.conf and potentially ifconfig eth0 txqueuelen 4096 also seemed to help. We did not get to see how much we can scale now but at least we broke previous threshold

enrico-usai commented 2 years ago

Thanks @rvencu for sharing all your improvements with the community.

According to OpenMPI documentation:

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Slurm directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Source: https://www.open-mpi.org/faq/?category=slurm

Anyway if you want to give a try you can generate the hostfile with a submission script like:

#!/bin/bash
#SBATCH ...

module load openmpi
scontrol show hostnames $SLURM_NODELIST > hostlist
mpirun --hostfile hostlist ...

note that $SLURM_NODELIST is only available within the environment of the batch submission script, when the nodes are already allocated to the job: https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES

rvencu commented 2 years ago

So the main change in slurm.conf was this MessageTimeout=180

To summarize all 3 changes that improved our installation:

we will benchmark how far can we scale up now at a later time when more p4d nodes will be available

rvencu commented 2 years ago

we noticed great reduction of headnode congestions but now the larger the scale,more compute to compute connections fail. and I believe we are in the nodes becoming ready to launch the job phase are losing IP connectivity to one another just to say: hey I am alive. I do not believe EFA is even triggered at this stage

also slurm messaging or openmpi messaging is more impacted than intelmessaging

we tried some timing debugging and we see the rate at which nodes become available is slow, too slow to feel normal

enrico-usai commented 2 years ago

@rvencu could you share how are you submitting the job (mpirun vs srun vs sbatch) and eventually the submission script? I'm asking to understand if there are other parameters we can suggest to tune out your environment (e.g. specifying some OMPI_* env variable).

When you're talking about the Database, do you mean you have Slurm accounting enabled? Can you share the details of the issues related to the database when you were using a small db size?

Could you also check the if there are NFS related errors in /var/log/messages or by running a tool like nfsstat?

It might be useful to check VPC Flow Logs to verify REJECTED messages, see:

Then, it's good to double check if you're following all the best practices on Slurm High Throughput official doc.

Enrico

rvencu commented 2 years ago

thanks. here are my nccl tests that fail if there are too many nodes

#!/bin/bash
#SBATCH --partition=compute-od-gpu
#SBATCH --job-name=nccl-tests
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH --output=%x_%j.out
module load openmpi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nccl/build/lib:/opt/aws-ofi-nccl-install/lib
export NCCL_PROTO=simple
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/aws-ofi-nccl/lib
export PATH=$PATH:/opt/amazon/efa/bin:/opt/amazon/openmpi/bin
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
export NCCL_DEBUG=info
export OMPI_MCA_mtl_base_verbose=1
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_TREE_THRESHOLD=0
mpirun -n 160 -N 8  --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker1 --bind-to none \
       /opt/nccl-tests/build/all_reduce_perf -b 128M -e 8G -f 2 -g 1 -c 1 -n 20

this uses openmpi mpirun obviously. with intelmpi we just load the intelmpi module and source the env as instructed, everything else stays the same

we submit with the sbatch <above file>

Yes, we have mySQL on RDS instance and want to use slurm accounting but right now nothing is configured except slurmdbd running as a service. with a small DB we encountered errors consistent with the database not being able to cope with the number of requests. Though we had some bug in the slurmdbd service definition that also made the process die so I would not try to debug this issue just yet.

cat /var/log/messages | grep NFS returns empty response

tonight we get 200 more p4d instances into a new ODCR. We will restart scaling tests since we have this capacity. I just activated the VPC flow logs for the entire VPC but filtered to REJECTED

We read the high throughput official doc from slurm but of course we will double check

Will try to capture and bring more logs from upcoming tests

rvencu commented 2 years ago

The following is a gist making an example of failing python script using pytorch ddp https://gist.github.com/rom1504/8da2e461e7416537c03170138e21f995

rvencu commented 2 years ago

we had a controller meltdown last night without traces in the log file. restarting the slurmctld service has restored the controller.

the reason I am mentioning this here is that I saw this line in the log when I restarted the service:

error: MessageTimeout is too high for effective fault-tolerance

perhaps we will need to revisit the setting we modified above and solve the root cause, not the symptom...

rvencu commented 2 years ago

we have rejected connections on the flow logs like this

2   8.42865E+11 eni-07a4f3018b5329a07   172.31.18.55    172.31.41.163   3306    46614   6   1   40  1658386089  1658386121  REJECT  OK
2   8.42865E+11 eni-07a4f3018b5329a07   172.31.18.55    172.31.41.163   3306    46640   6   1   40  1658386089  1658386121  REJECT  OK
2   8.42865E+11 eni-07a4f3018b5329a07   172.31.18.55    172.31.41.163   3306    46612   6   1   40  1658386089  1658386121  REJECT  OK
2   8.42865E+11 eni-07a4f3018b5329a07   172.31.18.55    172.31.41.163   3306    46892   6   1   40  1658386150  1658386181  REJECT  OK
2   8.42865E+11 eni-0944da571db9d820b   172.31.32.220   172.31.45.154   3306    47276   6   1   40  1658386087  1658386119  REJECT  OK
2   8.42865E+11 eni-0944da571db9d820b   172.31.32.220   172.31.45.154   3306    47164   6   1   40  1658386087  1658386119  REJECT  OK
2   8.42865E+11 eni-0944da571db9d820b   172.31.32.220   172.31.45.154   3306    47162   6   1   40  1658386087  1658386119  REJECT  OK
2   8.42865E+11 eni-0944da571db9d820b   172.31.32.220   172.31.45.154   3306    48688   6   2   80  1658386147  1658386179  REJECT  OK

I filtered source and destination to be from our VPC CIDR range. I am not sure if these are related to failed jobs, I think I need to log the nodes IPs then search in logs again

rvencu commented 2 years ago

I tracked down 2 nodes that could not communicate and there were no rejected attempts in the flow logs

------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    compute-od-gpu-dy-p4d-24xlarge-6
  Remote host:   compute-od-gpu-dy-p4d-24xlarge-33
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
^C
[ec2-user@ip-172-31-45-154 shared]$ ping compute-od-gpu-dy-p4d-24xlarge-6
PING compute-od-gpu-dy-p4d-24xlarge-6.hpc-1click-production4.pcluster (172.31.233.150) 56(84) bytes of data.
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=1 ttl=255 time=0.216 ms
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=2 ttl=255 time=0.180 ms
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=3 ttl=255 time=0.173 ms
64 bytes from ip-172-31-233-150.ec2.internal (172.31.233.150): icmp_seq=4 ttl=255 time=0.184 ms
^C
--- compute-od-gpu-dy-p4d-24xlarge-6.hpc-1click-production4.pcluster ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3052ms
rtt min/avg/max/mdev = 0.173/0.188/0.216/0.019 ms
[ec2-user@ip-172-31-45-154 shared]$ ping compute-od-gpu-dy-p4d-24xlarge-33
PING compute-od-gpu-dy-p4d-24xlarge-33.hpc-1click-production4.pcluster (172.31.238.0) 56(84) bytes of data.
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=1 ttl=255 time=0.215 ms
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=2 ttl=255 time=0.186 ms
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=3 ttl=255 time=0.184 ms
64 bytes from ip-172-31-238-0.ec2.internal (172.31.238.0): icmp_seq=4 ttl=255 time=0.178 ms

172.31.235.211
 172.31.233.150
 172.31.232.204
 172.31.231.18

 172.31.230.184
 172.31.238.0
 172.31.233.189
 172.31.232.55

I searched for all source and destinations like 172.31.23?.??? and there were no records

this leads me to think what is really happening is that the node 6 came alive earlier, found the list of all upcoming other nodes and tried to connect to them. node 33 came up too late, after the connection timeout

we need to understand why the nodes come up so slowly one after another. might be the post install script taking too long (while I do not see why the network should not work during that time)? or simply spinning up many p4d nodes at once is forming some kind of slow queue in AWS virtualization? or an artefact of the ODCR mechanism....

rvencu commented 2 years ago

Spun up a cluster with 160 nodes p4d permanently up and retried nccl-tests. Same issue more than 36 nodes and tests fail the same way. So time to bring up the nodes and run post-install scripts does not matter.

charlesg3 commented 2 years ago

Hi @rvencu ,

I just want to get some clarity on your latest test:

Spun up a cluster with 160 nodes p4d permanently up and retried nccl-tests.

Did all of the compute nodes successfully start? Was this without the post install script?

rvencu commented 2 years ago

I spun up 160 nodes (down from 174 presented as available to avoid last minute insufficient capacity errors). They are with post-install scripts and all. I guess the only script that is not ran is the slurm prolog

I added some custom things in that one but the problem manifest the same as before. custom things were done to tame the cuda in the containers

https://github.com/rvencu/1click-hpc/blob/main/scripts/prolog.sh

At least this is my current understanding - when launched the compute node execute post-install scripts automatically while prolog is executed before a job is started

rvencu commented 2 years ago

Actually the only thing that seems to run better (does not mean perfect) from all our tests is:

  1. load intelmpi and activate it
  2. use mpirun
  3. use a single task per node
charlesg3 commented 2 years ago

@rvencu -- did you try specifying the hostfile directly with openmpi (see: https://github.com/aws/aws-parallelcluster/issues/4179#issuecomment-1181565998)? Curious if that changes the behavior at all?

rvencu commented 2 years ago

this is the result with the explicit hostfile, slightly different but the same (maybe it is headnode to compute problem)

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
rvencu commented 2 years ago

well going to square 1, and trying super basic things on 42 nodes

debug.sh

START_TIME=$JOBSTART
echo "$(($(date +%s) - $START_TIME)) : $(hostname)"

test.sh

#!/bin/bash
#SBATCH --partition=compute-od-gpu
#SBATCH --job-name=test
#SBATCH --nodes=42
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
#SBATCH --output=%x_%j.out
module load openmpi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nccl/build/lib:/opt/aws-ofi-nccl-install/lib
export NCCL_PROTO=simple
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/aws-ofi-nccl/lib
export PATH=$PATH:/opt/amazon/efa/bin:/opt/amazon/openmpi/bin
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
export NCCL_DEBUG=info
export OMPI_MCA_mtl_base_verbose=1
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_TREE_THRESHOLD=0
export JOBSTART=$(date +%s)
srun sh debug.sh

result:

0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
68 : compute-od-gpu-st-p4d-24xlarge-36
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
69 : compute-od-gpu-st-p4d-24xlarge-37
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
129 : compute-od-gpu-st-p4d-24xlarge-23
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
134 : compute-od-gpu-st-p4d-24xlarge-2
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-22
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
137 : compute-od-gpu-st-p4d-24xlarge-34
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
145 : compute-od-gpu-st-p4d-24xlarge-30
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
147 : compute-od-gpu-st-p4d-24xlarge-10
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-12
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
161 : compute-od-gpu-st-p4d-24xlarge-4
srun: error: slurm_receive_msgs: [[ip-172-31-224-213.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-237-223.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-222.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-160.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-227-160.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-193.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-238-216.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-196.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-226-192.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-230-255.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-238-197.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-224-203.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-234-202.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-236-235.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-38.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-82.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-230-207.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-231-204.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-227-193.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-232-219.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-233-200.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-31.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-208.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-84.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-226-249.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-224-83.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-237.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-234-84.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-39.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-225-245.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-3: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-5: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-6: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-7: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-8: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-9: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-11: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-17: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-13: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-14: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-15: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-16: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-18: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-19: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-20: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-42: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-21: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-24: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-27: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-26: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-28: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-29: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-40: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-31: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-32: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-33: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-38: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-35: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-39: Socket timed out on send/recv operation
srun: error: Task launch for StepId=20.0 failed on node compute-od-gpu-st-p4d-24xlarge-41: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
192 : compute-od-gpu-st-p4d-24xlarge-17
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
206 : compute-od-gpu-st-p4d-24xlarge-14
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
207 : compute-od-gpu-st-p4d-24xlarge-13
srun: error: Timed out waiting for job step to complete
rvencu commented 2 years ago

it seems that even 36 nodes barely work

0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
0 : compute-od-gpu-st-p4d-24xlarge-1
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
54 : compute-od-gpu-st-p4d-24xlarge-36
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
57 : compute-od-gpu-st-p4d-24xlarge-10
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
61 : compute-od-gpu-st-p4d-24xlarge-9
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
89 : compute-od-gpu-st-p4d-24xlarge-20
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
96 : compute-od-gpu-st-p4d-24xlarge-32
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
112 : compute-od-gpu-st-p4d-24xlarge-25
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
129 : compute-od-gpu-st-p4d-24xlarge-12
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
138 : compute-od-gpu-st-p4d-24xlarge-31
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
141 : compute-od-gpu-st-p4d-24xlarge-2
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
143 : compute-od-gpu-st-p4d-24xlarge-14
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
149 : compute-od-gpu-st-p4d-24xlarge-21
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-28
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
154 : compute-od-gpu-st-p4d-24xlarge-27
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
158 : compute-od-gpu-st-p4d-24xlarge-24
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-29
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
159 : compute-od-gpu-st-p4d-24xlarge-33
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
161 : compute-od-gpu-st-p4d-24xlarge-7
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
170 : compute-od-gpu-st-p4d-24xlarge-3
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
171 : compute-od-gpu-st-p4d-24xlarge-15
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
172 : compute-od-gpu-st-p4d-24xlarge-11
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-16
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
178 : compute-od-gpu-st-p4d-24xlarge-18
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-35
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
181 : compute-od-gpu-st-p4d-24xlarge-34
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
182 : compute-od-gpu-st-p4d-24xlarge-26
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
185 : compute-od-gpu-st-p4d-24xlarge-17
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
186 : compute-od-gpu-st-p4d-24xlarge-8
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
187 : compute-od-gpu-st-p4d-24xlarge-30
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
189 : compute-od-gpu-st-p4d-24xlarge-22
srun: error: slurm_receive_msgs: [[ip-172-31-226-213.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-228-222.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-237-223.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-226-192.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-236-235.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: [[ip-172-31-229-34.ec2.internal]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-4: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-6: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-5: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-13: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-19: Socket timed out on send/recv operation
srun: error: Task launch for StepId=22.0 failed on node compute-od-gpu-st-p4d-24xlarge-23: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-od-gpu-st-p4d-24xlarge-6: tasks 40-47: Killed
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
191 : compute-od-gpu-st-p4d-24xlarge-13
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
193 : compute-od-gpu-st-p4d-24xlarge-5
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
194 : compute-od-gpu-st-p4d-24xlarge-23
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
195 : compute-od-gpu-st-p4d-24xlarge-4
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
197 : compute-od-gpu-st-p4d-24xlarge-19
charlesg3 commented 2 years ago

Hi @rvencu -- this is valuable new information, are the new tests done with a post-install script as well?

Out of curiosity what is the OS you are using?

Can you also paste the full cluster config and link/paste post-install script if you're using one?

charlesg3 commented 2 years ago

It would be great to see the other side of this bisection effort (eg a case where the nodes boot successfully) of you haven't already can you remove the prolog and post install to see if the nodes are able to start without any additional customizations?

rvencu commented 2 years ago

Before attempting above, here is incomplete result of manual experimenting with various setups like: native openmpi vs openmpi with srun, vs native intelmpi vs intelmpi with srun image

This explains why native intelmpi run (I included the -bootstrap slurm though) can scale up much more than the rest

I did this with all post-install scripts enabled, indeed. Nodes were already running from AWS point of view. This is time purely taken by slurm/mpi to set the tasks

wzlu-aws commented 2 years ago

The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability.

rvencu commented 2 years ago

The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability.

We clearly need to learn how to design the cascading scalability and attempt that... Do you have any documentation suggestions?

wzlu-aws commented 2 years ago

The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability.

We clearly need to learn how to design the cascading scalability and attempt that... Do you have any documentation suggestions?

It's kind of difficult to find a clear document on this. (1) I'd start with https://www.open-mpi.org/faq/?category=large-clusters. (2) Try to use OpenMPI built-in scheduler integration for launching tasks (--mca parameter).

charlesg3 commented 2 years ago

@rvencu -- I am curious also if it is related to this issue: https://github.com/open-mpi/ompi/issues/4578 and perhaps adding -mca plm_rsh_no_tree_spawn 1 to mpirun would help?

rvencu commented 2 years ago

It would be great to see the other side of this bisection effort (eg a case where the nodes boot successfully) of you haven't already can you remove the prolog and post install to see if the nodes are able to start without any additional customizations?

I am preparing to do so. In the meanwhile slurm support suggested this

The prolog.sh attached is run on every step which could slow startup. If you want your job to start and not wait for the prolog, set this in slurm.conf:
> PrologFlags=NoHold

However, looking at the prolog.sh, it appears to be a job script, not a per-step script. It is also doing RPCs to the controller which is explicitly ill-advised in a prolog/epilog script.

I suggest swapping:
> Prolog=/opt/slurm/etc/prolog.sh

to
> PrologSlurmctld=/opt/slurm/etc/prolog.sh

in the slurm.conf and restarting all daemons.

Will update here when I have results

rvencu commented 2 years ago

As a sidenote, running so many test jobs with large number of nodes and watching squeue I noticed that releasing the nodes is also a long process, and the nodes are released slowly, with similar timing as we got at start

And there is no epilog script in the configuration. Thought this is worth mentioning

charlesg3 commented 2 years ago

Yes -- I also note that the prolog script makes calls to the AWS api which risks throttling at higher scale which I why I wanted to dig into how much of an effect that might be causing versus potentially other issues.

rvencu commented 2 years ago

ok, getting prolog out of config makes the launch timing to fall to 1 second for 40 nodes so yes, this seems to be the root cause

I will introduce now the things that seem safe and time them again to detect where is the bottleneck

charlesg3 commented 2 years ago

@rvencu -- excellent!

Are you able to scale to all 200 nodes without the prolog?

rvencu commented 2 years ago

Do not have 200 nodes but I ran nccl tests with 120 nodes I found available and it was successful.

Problem in prolog - instances tagging API is slow, need to move that separately, async.

# Out of bounds values : 0 OK
# Avg bus bandwidth    : 12.3392