Closed rvencu closed 2 years ago
Okay great -- so it sounds like the root cause has been identified for this particular issue?
As the prolog script is not directly a part of this project, I will mark this issue as closed if that is the root cause.
thanks a lot.
We started an HPC cluster and placed the HeadNode into a public subnet for easy access from the internet while the compute fleet sits on a private subnet.
While we had a limited number of p4d instances available things were running normal but we recently got new capacity reservation and node count went up to 194, allowing us to try large jobs with over 100 nodes / 800 GPUs with EFA enables, these instances having 4 100Gbps network interfaces each (so > 40000 Gbps per job).
We noticed jobs failure due to network errors both in headnode-computenode operations as well as compute-to-compute comms.
Especially openmpi seems sensitive to this, it fails to launch batch jobs for more than 35 nodes. Intelmpi seems more resilient, we were able to run 80 nodes jobs but nothing above that. We also had an event when slurm master lost connection to the entire fleet, sending all compute down and rescheduling all jobs.
I want to get some network recommendations for this scenario. What I am thinking now is to move the headnode in the same private subnet as compute fleet (and provide SSH connectivity via tunneling).
Is there anything that we could do at the level of VPC/subnet configurations to ease up the congestion?