dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
https://cloudprovider.dask.org
BSD 3-Clause "New" or "Revised" License
130 stars 103 forks source link

ECS import fails with " failed to start" #209

Open eddojansen opened 3 years ago

eddojansen commented 3 years ago

from dask_cloudprovider.aws import ECSCluster cluster = ECSCluster( ... cluster_arn="arn:aws:ecs:us-west-2:131360002788:cluster/ejansen-cluster-3", ... n_workers=2, ... worker_gpu=1, ... fargate_scheduler=True ... ) Traceback (most recent call last): File "", line 5, in File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 727, in init super().init(*kwargs) File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/spec.py", line 276, in init self.sync(self._start) File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/cluster.py", line 183, in sync return sync(self.loop, func, args, **kwargs) File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/utils.py", line 340, in sync raise exc.with_traceback(tb) File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/utils.py", line 324, in f result[0] = yield future File "/home/ubuntu/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 924, in _start await super()._start() File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/spec.py", line 304, in _start self.scheduler = await self.scheduler File "/home/ubuntu/.local/lib/python3.6/site-packages/daskcloudprovider/aws/ecs.py", line 162, in await self.start() File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 284, in start raise RuntimeError("%s failed to start" % type(self).name) RuntimeError: Scheduler failed to start

What happened: I'm trying to test NVIDIA RAPIDS functionality using Dask following this guide: https://rapids.ai/cloud#AWS-EC2

First issue following the manual is the requirement of using dask_cloudprovider.aws instead of dask_cloudprovider. The second issue is that dask never completes the import and fails with the above error.

What you expected to happen: I was expecting the cluster to be imported.

Minimal Complete Verifiable Example: Follow the guide https://rapids.ai/cloud#AWS-EC2

>>> from dask_cloudprovider.aws import ECSCluster
>>> cluster = ECSCluster(
...                             cluster_arn="arn:aws:ecs:us-west-2:xxxxxxxxxx:cluster/ejansen-cluster-3",
...                             n_workers=2,
...                             worker_gpu=1,
...                             fargate_scheduler=True

# Put your MCVE code here

Anything else we need to know?:

Environment:

Thanks! Eddo

jacobtomlinson commented 3 years ago

Thanks for raising this @eddojansen.

This happens when the container fails to start. When ECS creates the container it is in a PENDING or PROVISIONING state, then it should move to a RUNNING state. This error is raised when it moves to something else (ERROR for example).

The best course of action here is to look at the task in the AWS dashboard to see what went wrong. It would be interesting if you could share that here.

We should also improve this error message for sure.

eddojansen commented 3 years ago

@jacobtomlinson

After double verifying the cluster and instances are up and green, I'm still experiencing the same issue. I verified that my AWS api access works by running aws ecs list-clusters:

{
    "clusterArns": [
        "arn:aws:ecs:us-west-2:xxxxxxxxxxx:cluster/ejansen-cluster-4"
    ]
}

The task list in the ECS cluster is empty.. Any ideas or suggestions?

Thanks, Eddo

>>> cluster = ECSCluster(
...                             cluster_arn="arn:aws:ecs:us-west-2:xxxxxxxxxxx:cluster/ejansen-cluster-4",
...                             n_workers=2,
...                             worker_gpu=1,
...                             fargate_scheduler=True
...                          )

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 727, in __init__
    super().__init__(**kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/spec.py", line 276, in __init__
    self.sync(self._start)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/cluster.py", line 183, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/home/ubuntu/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 924, in _start
    await super()._start()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/spec.py", line 304, in _start
    self.scheduler = await self.scheduler
  File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 162, in _
    await self.start()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 284, in start
    raise RuntimeError("%s failed to start" % type(self).__name__)
RuntimeError: Scheduler failed to start
Brontomerus commented 3 years ago

From my experiences, the scheduler failing to start has been related to 1 of 2 reasons, with the 2nd being the more likely of the two:

  1. The allocation of public IPs is not set and there are issues with the network setup. (ie the setting fargate_use_private_ip=True)
  2. The task is being launched in a private subnet with no access to the internet, and is unable to pull the container.

It seems obvious that you'd know if it were the 2nd, but if you are using the default subnets without declaring which to use, there's a chance that its using a subnet that doesn't have a NAT gateway or correct routing set up for the particular use case.