dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.6k stars 157 forks source link

[Feature]: Enable all available network cards on AWS instances #1804

Open un-def opened 1 month ago

un-def commented 1 month ago

Problem

Currently, with the AWS backend dstack unconditionally requests one network interface, even with instance types that have multiple network cards (e.g., p5.48xlarge has 32 EFA-capable cards). Network performance is crucial for distributed HPC workloads, thus a single network interface may be a bottleneck.

Solution

Enable all available interfaces by default.

Workaround

No response

Would you like to help us implement this feature by sending a PR?

No

solovyevt commented 1 month ago

Just random questions/thoughts:

un-def commented 1 month ago

Sorry for the delayed reply.

I assume the preferred approach would be to have EFAs created/deleted alongside the associated node

Yes, ideally lifetime of resources should be bound to the parent resource lifetime.

AFAIK, only a single EFA can be attached upon EC2 creation

According to this snippet, it's possible to request multiple EFA interfaces via RunInstances, but it needs to be verified.

If it's true, the only limitation seems to be that we cannot use associatePublicIPAddress: true with multiple network interfaces via RunInstances method.

can be attached only by stopping the node

Didn't know about this limitation, but you are right, we have to stop the instance first to attach interfaces.

A more viable way to attach multiple EFAs right away could be via EC2 Launch Templates

Not sure if it doesn't have the same limitation with associatePublicIPAddress, needs to be checked, and, as you noted, migrating to templates would require additional re-working. If it's possible to work around the associatePublicIPAddress limitation with the current AWSCompute.create_instance() implementation, I think it would be a preferred way, even if a bit hacky (i.e. create an instance → stop → attach interfaces → start).

Each EFA requires an IP address

As far as I understand, only primary interface is required to have a public IP address to make in possible to connect to the instance, all other interfaces are only used for node-to-node connectivity within a private network.

As for private IPs, according to the AWS docs, “The allowed IPv4 CIDR block size for a subnet is between a /28 netmask and /16 netmask” with “[d]efault subnets within a default VPC are assigned /20 netblocks within the VPC CIDR range. ”

solovyevt commented 2 weeks ago

Confirming what you've mentioned above:

I'll create a PR with a possible approach to this.