Open un-def opened 1 month ago
Just random questions/thoughts:
Sorry for the delayed reply.
I assume the preferred approach would be to have EFAs created/deleted alongside the associated node
Yes, ideally lifetime of resources should be bound to the parent resource lifetime.
AFAIK, only a single EFA can be attached upon EC2 creation
According to this snippet, it's possible to request multiple EFA interfaces via RunInstances, but it needs to be verified.
If it's true, the only limitation seems to be that we cannot use associatePublicIPAddress: true
with multiple network interfaces via RunInstances method.
can be attached only by stopping the node
Didn't know about this limitation, but you are right, we have to stop the instance first to attach interfaces.
A more viable way to attach multiple EFAs right away could be via EC2 Launch Templates
Not sure if it doesn't have the same limitation with associatePublicIPAddress
, needs to be checked, and, as you noted, migrating to templates would require additional re-working. If it's possible to work around the associatePublicIPAddress
limitation with the current AWSCompute.create_instance()
implementation, I think it would be a preferred way, even if a bit hacky (i.e. create an instance → stop → attach interfaces → start).
Each EFA requires an IP address
As far as I understand, only primary interface is required to have a public IP address to make in possible to connect to the instance, all other interfaces are only used for node-to-node connectivity within a private network.
As for private IPs, according to the AWS docs, “The allowed IPv4 CIDR block size for a subnet is between a /28 netmask and /16 netmask” with “[d]efault subnets within a default VPC are assigned /20 netblocks within the VPC CIDR range. ”
Confirming what you've mentioned above:
AssociatePublicIpAddress=true
. I'll create a PR with a possible approach to this.
Problem
Currently, with the AWS backend
dstack
unconditionally requests one network interface, even with instance types that have multiple network cards (e.g.,p5.48xlarge
has 32 EFA-capable cards). Network performance is crucial for distributed HPC workloads, thus a single network interface may be a bottleneck.Solution
Enable all available interfaces by default.
Workaround
No response
Would you like to help us implement this feature by sending a PR?
No