dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
https://cloudprovider.dask.org
BSD 3-Clause "New" or "Revised" License
134 stars 110 forks source link

Add support for private only Fargate clusters #14

Open jacobtomlinson opened 5 years ago

jacobtomlinson commented 5 years ago

Currently when running in Fargate mode the scheduler and workers must be issued a public IP address.

This is due to a limitation in AWS where Fargate is unable to pull docker images without external networking. Having a public IP provides this networking.

An alternative would be to have a NAT gateway in the VPC subnets used by Fargate which will handle the internet traffic. Some users may have one already if they are using an existing VPC.

This needs some thought.

mshvartsman commented 3 years ago

This is how I'm trying to get set up right now. Here's what I've done so far:

This works for a few minutes but then the whole cluster goes down. What I'm seeing in the scheduler logs is a flood of messages like Unexpected worker completed task, likely due to work stealing and that's it, no obvious crash messages. Workers exit with signal 6 or 15, and no complete traceback (sometimes partial traceback includes stuff in threading.py but nothing informative.

If I spin up the same cluster without the security group, private IP, and VPC flag, it works fine. Any advice of where to look next?

jacobtomlinson commented 3 years ago

Thanks for this @mshvartsman.

Some information on the work you are trying to run on the cluster would be helpful. Does this happen for any tasks submitted to the cluster?

Also does Cloud Watch Logs provide any further info on the worker logs?

mshvartsman commented 3 years ago

Sorry for not responding to this sooner (didn't get notified for some reason). My current bet is on user error here -- there seems to be a memory leak in my code that eventually gets jobs killed. The amount of time it was taking was variable so I just didn't wait long enough for the open cluster to crash (which it still did eventually). Once I track the memory leak down I'll re-check and see if this works, in which case I can probably PR some docs on this if you'd like and we can close the issue.