aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
828 stars 312 forks source link

Multiarch Cluster (Graviton + X86) #4319

Open ChristianKniep opened 2 years ago

ChristianKniep commented 2 years ago

As discussed via email with @demartinofra and Austin; I'd like to create a cluster with X86 and ARM compute nodes. In my case with a x86 headnode.

AFAIU this is currently not possible since the headnode exports /opt/slurm to all compute nodes.

$ sudo exportfs|grep slurm
/opt/slurm      10.0.0.0/16

Thus, the compute nodes use the x86 binaries under /opt/slurm/bin and will segfault on any slurm commands.

hanwen-pcluste commented 2 years ago

Hi Christian,

Unfortunately, ParallelCluster does not support multiArch clusters. Could you describe your use case in detail?

Thank you, Hanwen

cartalla commented 1 year ago

In the EDA world, many applications support both architectures. Arm64 is lower cost/performance, but not all applications support it. In that use case both architectures are required. For capacity reasons it also makes more instances available for large workloads.

yoanisgil commented 1 year ago

I don't know about @ChristianKniep's use case but in ours we have a mix of GPU/CPU queues and we wanted to try the C7g instances for some of our CPU workloads (as they are cheaper and supposed to be faster).

However, since ParallelCluster does not support multiArch clusters, we also need to migrate our GPU queue, which currently setup to use a g4dn.2xlarge, to use a g5g.4xlarge instance type. This however is not viable for us as most of our workloads are GPU driven and moving from G4dn -> G5g represents ~ 10% increase in costs.