dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
https://cloudprovider.dask.org
BSD 3-Clause "New" or "Revised" License
130 stars 104 forks source link

AWS EC2 - Automatically chosen subnet does not match parametrized AZ #428

Open cmillani opened 1 month ago

cmillani commented 1 month ago

Describe the issue:

ec2.py chooses the first subnet from the specified (or default, if none specified) VPC, ignoring the availability_zone(AZ) parameter.

Some VMs are not supported on all AZs, so it is necessary to provide an AZ, but doing so may conflict with the subnet selected on the step described above.

Minimal Complete Verifiable Example:

At the time of writing m5.large instance is not supported on us-east-1e, and that is the subnet returned at index 0 when listing subnets from default VPC in my case.

To better reproduce we can force use of the AZ of second subnet, creating this code:

from dask_cloudprovider.aws.ec2 import EC2Cluster
from dask_cloudprovider.aws.helper import (
    get_default_vpc,
)
from aiobotocore.session import get_session
import botocore

boto_config = botocore.config.Config(retries=dict(max_attempts=10))
region = "us-east-1"
async with get_session().create_client("ec2", region_name=region, config=boto_config) as client:
    vpc = await get_default_vpc(client)
    subnets = (await client.describe_subnets())["Subnets"]
    az = subnets[1]['AvailabilityZone'] # Code at `dask_cloudprovider.aws.ec2` gets subnet at [0], so this will force the issue
    EC2Cluster(
        region="us-east-1",
        availability_zone=az,
        security=False, # Simply to avoid requiring criptography package
        scheduler_instance_type="m5.large",
        worker_instance_type="m5.large",
    )

This outputs:

2024-06-09 14:46:30,639 - distributed.deploy.spec - WARNING - Cluster closed without starting up

And inspecting the stack we can see the following error:

[...]
ClientError: An error occurred (InvalidParameterValue) when calling the RunInstances operation: Value (us-east-1a) for parameter availabilityZone is invalid. Subnet '<REDACTED>' is in the availability zone us-east-1e

During handling of the above exception, another exception occurred:
[...]

Anything else we need to know?:

Changing dask_cloudprovider.aws.helper.get_vpc_subnets to receive and consider the AvailabilityZone should fix the issue. If this makes sense I could open a PR! :)

async def get_vpc_subnets(client, vpc, availability_zone):
    vpcs = (await client.describe_vpcs())["Vpcs"]
    [vpc] = [x for x in vpcs if x["VpcId"] == vpc]
    subnets = (await client.describe_subnets())["Subnets"]
    return [subnet["SubnetId"] for subnet in subnets if subnet["VpcId"] == vpc["VpcId"] and subnet["AvailabilityZone"] == availability_zone]

Environment:

jacobtomlinson commented 1 month ago

A PR to do this would be very welcome!