Open eddojansen opened 3 years ago
Thanks for raising this @eddojansen.
This happens when the container fails to start. When ECS creates the container it is in a PENDING or PROVISIONING state, then it should move to a RUNNING state. This error is raised when it moves to something else (ERROR for example).
The best course of action here is to look at the task in the AWS dashboard to see what went wrong. It would be interesting if you could share that here.
We should also improve this error message for sure.
@jacobtomlinson
After double verifying the cluster and instances are up and green, I'm still experiencing the same issue. I verified that my AWS api access works by running aws ecs list-clusters:
{
"clusterArns": [
"arn:aws:ecs:us-west-2:xxxxxxxxxxx:cluster/ejansen-cluster-4"
]
}
The task list in the ECS cluster is empty.. Any ideas or suggestions?
Thanks, Eddo
>>> cluster = ECSCluster(
... cluster_arn="arn:aws:ecs:us-west-2:xxxxxxxxxxx:cluster/ejansen-cluster-4",
... n_workers=2,
... worker_gpu=1,
... fargate_scheduler=True
... )
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 727, in __init__
super().__init__(**kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/spec.py", line 276, in __init__
self.sync(self._start)
File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/cluster.py", line 183, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/home/ubuntu/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 924, in _start
await super()._start()
File "/home/ubuntu/.local/lib/python3.6/site-packages/distributed/deploy/spec.py", line 304, in _start
self.scheduler = await self.scheduler
File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 162, in _
await self.start()
File "/home/ubuntu/.local/lib/python3.6/site-packages/dask_cloudprovider/aws/ecs.py", line 284, in start
raise RuntimeError("%s failed to start" % type(self).__name__)
RuntimeError: Scheduler failed to start
From my experiences, the scheduler failing to start has been related to 1 of 2 reasons, with the 2nd being the more likely of the two:
It seems obvious that you'd know if it were the 2nd, but if you are using the default subnets without declaring which to use, there's a chance that its using a subnet that doesn't have a NAT gateway or correct routing set up for the particular use case.
What happened: I'm trying to test NVIDIA RAPIDS functionality using Dask following this guide: https://rapids.ai/cloud#AWS-EC2
First issue following the manual is the requirement of using dask_cloudprovider.aws instead of dask_cloudprovider. The second issue is that dask never completes the import and fails with the above error.
What you expected to happen: I was expecting the cluster to be imported.
Minimal Complete Verifiable Example: Follow the guide https://rapids.ai/cloud#AWS-EC2
Anything else we need to know?:
Environment:
Thanks! Eddo