Add support for private only Fargate clusters

jacobtomlinson commented 5 years ago

Currently when running in Fargate mode the scheduler and workers must be issued a public IP address.

This is due to a limitation in AWS where Fargate is unable to pull docker images without external networking. Having a public IP provides this networking.

An alternative would be to have a NAT gateway in the VPC subnets used by Fargate which will handle the internet traffic. Some users may have one already if they are using an existing VPC.

This needs some thought.

mshvartsman commented 3 years ago

This is how I'm trying to get set up right now. Here's what I've done so far:

Create my own VPC with an id I pass into FargateCluster. This VPC has endpoints set up for ECR and S3 as instructed here https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html
Pass fargate_use_private_ip=True.
Create a security group that only allows communication within the security group, and pass that ID to FargateCluster as well.
Create a role with AmazonECSTaskExecutionRolePolicy attached and pass that to execution_role_arn.
Spin up a small EC2 node that has both that security group attached, plus one that allows port 22 (SSH). Then I SSH into that node and run everything from there (which also makes things robust to connection hiccups on my end).

This works for a few minutes but then the whole cluster goes down. What I'm seeing in the scheduler logs is a flood of messages like Unexpected worker completed task, likely due to work stealing and that's it, no obvious crash messages. Workers exit with signal 6 or 15, and no complete traceback (sometimes partial traceback includes stuff in threading.py but nothing informative.

If I spin up the same cluster without the security group, private IP, and VPC flag, it works fine. Any advice of where to look next?

jacobtomlinson commented 3 years ago

Thanks for this @mshvartsman.

Some information on the work you are trying to run on the cluster would be helpful. Does this happen for any tasks submitted to the cluster?

Also does Cloud Watch Logs provide any further info on the worker logs?

mshvartsman commented 3 years ago

Sorry for not responding to this sooner (didn't get notified for some reason). My current bet is on user error here -- there seems to be a memory leak in my code that eventually gets jobs killed. The amount of time it was taking was variable so I just didn't wait long enough for the open cluster to crash (which it still did eventually). Once I track the memory leak down I'll re-check and see if this works, in which case I can probably PR some docs on this if you'd like and we can close the issue.

dask / dask-cloudprovider

Add support for private only Fargate clusters #14