aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
832 stars 312 forks source link

Pcluster 3.1.4 - network congestion for large scale jobs #4179

Closed rvencu closed 2 years ago

rvencu commented 2 years ago

We started an HPC cluster and placed the HeadNode into a public subnet for easy access from the internet while the compute fleet sits on a private subnet.

While we had a limited number of p4d instances available things were running normal but we recently got new capacity reservation and node count went up to 194, allowing us to try large jobs with over 100 nodes / 800 GPUs with EFA enables, these instances having 4 100Gbps network interfaces each (so > 40000 Gbps per job).

We noticed jobs failure due to network errors both in headnode-computenode operations as well as compute-to-compute comms.

Especially openmpi seems sensitive to this, it fails to launch batch jobs for more than 35 nodes. Intelmpi seems more resilient, we were able to run 80 nodes jobs but nothing above that. We also had an event when slurm master lost connection to the entire fleet, sending all compute down and rescheduling all jobs.

I want to get some network recommendations for this scenario. What I am thinking now is to move the headnode in the same private subnet as compute fleet (and provide SSH connectivity via tunneling).

Is there anything that we could do at the level of VPC/subnet configurations to ease up the congestion?

charlesg3 commented 2 years ago

Okay great -- so it sounds like the root cause has been identified for this particular issue?

As the prolog script is not directly a part of this project, I will mark this issue as closed if that is the root cause.

rvencu commented 2 years ago

thanks a lot.