dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
https://cloudprovider.dask.org
BSD 3-Clause "New" or "Revised" License
135 stars 110 forks source link

Add create cluster support for ECSCluster on EC2 #23

Open kylejn27 opened 5 years ago

kylejn27 commented 5 years ago

Wanted to open this up for discussion. How much should ECSCluster take care of for the user on the topic of creating a cluster.

There's generally two ways of going about doing this:

For the former, nothing needs to be done as ECSCluster already handles this. The latter requires discussion on how opinionated ECSCluster should be with how it creates the cluster since there are a couple different ways of spinning up resources in AWS (boto, terraform, cloudformation, ... to name a few)

Ideas that have been presented already:

jacobtomlinson commented 5 years ago

Here are a bunch of thoughts on the design of this module. Ideally I'd like to iterate on this a bit and then add a section to the docs about it. Input would be much appreciated.

High level goals

Here are some general goals that this library sets out to solve:

AWS cluster managers

I feel we have achieved these goals with FargateCluster for AWS. This is a subclass of ECSCluster which sets fargate mode to true by default.

When you create any ECSCluster with no arguments it creates supporting resources including security groups, IAM roles and log groups for you and uses the default VPC and subnets. I've drawn the line here as none of these resources cost money. Then when you run in Fargate mode you pay per second for the compute resources you use and they clean themselves up. You can also specify the supporting resources yourself if you've created them already. Therefore it meets the principles above.

However when running in EC2 mode I cannot see an easy way to automatically clean up instances and therefore you could end up accidentally leaving things running and costing money. This makes me hesitant to create EC2 resources in the first place and currently when running in EC2 mode you are required to provide your own ECS cluster. There are probably ways around this with autoscaling groups that scale to zero.

I'm also conscious that creating resources with boto3 may not be the best practice and using IaC tools like CloudFormation or Terraform would be more appropriate. It's difficult to know where to draw this line.

I'm tempted to say that because we have met all our principles for AWS with FargateCluster we can relax a little for other AWS offerings. If users want a more bespoke cluster with opinionated configurations we should facilitate that but push some work back on to them. For example perhaps we need to include a section in the docs on how to bootstrap a cluster via different means which could then be handed off to ECSCluster.

Alternatively we could create a sensible ECS cluster in its own VPC and subnets. My only concern here (aside from scaling to zero) is that in order to allow people to customize their cluster we may have to provide a lot of additional configuration. It may also mean these clusters could be quite brittle. For example if we allow people to specify their own user data, but require that certain things exist in the user data, it could be very easy for this to be misconfigured.

kylejn27 commented 5 years ago

I agree with what you've said, the main opinions you've asserted make sense, especially when it comes to the original "what problem are we solving" part. Those assertions certainly shed some light on why the library was designed as it currently exists today.

Here are my thoughts:

When you create any ECSCluster with no arguments it creates supporting resources including security groups, IAM roles and log groups for you and uses the default VPC and subnets. I've drawn the line here as none of these resources cost money. Then when you run in Fargate mode you pay per second for the compute resources you use and they clean themselves up. You can also specify the supporting resources yourself if you've created them already. Therefore it meets the principles above.

completely agree, as it stands the library meets all principals that you've laid out. Perhaps as you suggest below, pushing the work back on the users is best and leave it at that. This is completely reasonable


I'm also conscious that creating resources with boto3 may not be the best practice and using IaC tools like CloudFormation or Terraform would be more appropriate. ... Alternatively we could create a sensible ECS cluster in its own VPC and subnets. My only concern here (aside from scaling to zero) is that in order to allow people to customize their cluster we may have to provide a lot of additional configuration.

Maybe the right move is to moving the entire library towards provisions resources with CloudFormation. The library could use default templates held within the library itself for both EC2 and Fargate ECS deployments. Using the boto client for cloudformation only to manage the CloudFormation template. Now that we have defaults managed in CloudFormation, it opens up the possibility for customization of the stack where the library allows the user to specify their own CloudFormation template that gets used instead of the default. This way the library provides a reasonable default for both Faragate and EC2 deployment, the two methods of operating ECS, allowing for greater flexibility while still restricting it to a basic starter cluster.


For example perhaps we need to include a section in the docs on how to bootstrap a cluster via different means which could then be handed off to ECSCluster.

Also totally reasonable, maybe even a simple starter script may help.


However when running in EC2 mode I cannot see an easy way to automatically clean up instances and therefore you could end up accidentally leaving things running and costing money. This makes me hesitant to create EC2 resources in the first place and currently when running in EC2 mode you are required to provide your own ECS cluster. There are probably ways around this with autoscaling groups that scale to zero.

Should be able to maintain a link to the Autoscaling group name/arn/whatever identifies an asg to call a delete method upon destruction of the ECSCluster object. I think the assumption should be that the user should be aware of all resources getting created and must assume responsibility of verifying that all resources have spun down accordingly.

Maybe a better idea would be an explicit cluster.close() or cluster.destroy() method that a user can run to destroy all created resources. This may be better than hooks on the create methods to destroy resources upon ECSCluster object destruction as those can be a little ambiguous as to when they run.

jacobtomlinson commented 5 years ago

Maybe the right move is to moving the entire library towards provisions resources with CloudFormation.

This is interesting and I hadn't put a huge amount of thought to it. A few questions:

Also totally reasonable, maybe even a simple starter script may help.

Yep that's a good idea. Is this something you could help with?

Should be able to maintain a link to the Autoscaling group name/arn/whatever identifies an asg to call a delete method upon destruction of the ECSCluster object.

The trouble with this is that there are load of scenarios where destruction of the ECSCluster object never happens. This could be a power outage, OOM killer, hard reset, etc.

I think the assumption should be that the user should be aware of all resources getting created and must assume responsibility of verifying that all resources have spun down accordingly.

I'm not sure I want to put this burden on to novice users. Many folks who will be using this library will not be familiar enough with the cloud provider they are using to understand what resources are being created for them.

Maybe a better idea would be an explicit cluster.close() or cluster.destroy() method that a user can run to destroy all created resources.

Yes we should. This is likely covered in #9.

kylejn27 commented 5 years ago

Maybe the right move is to moving the entire library towards provisions resources with CloudFormation.

This is interesting and I hadn't put a huge amount of thought to it. A few questions:

  • Would you imagine storing some templates as yaml/json and using something like jinja to populate them?
  • As well as giving the option to override the template all together?
  • Would it be one template with optional sections or many templates which are concatenated?

Jinja is probably fine, though maybe using the standard cloudformation parameters would work better.

My initial thought would be to provide the ability to completely override the default template with a custom one for custom/complex clusters. There would be one template for Fargate and one for EC2 based ECS (maybe they share a separate common template for things like security groups? if thats possible). Parameters with defaults would allow the user to specify things like security groups, subnets, etc to customize the default template a little bit.

Should be able to maintain a link to the Autoscaling group name/arn/whatever identifies an asg to call a delete method upon destruction of the ECSCluster object.

The trouble with this is that there are load of scenarios where destruction of the ECSCluster object never happens. This could be a power outage, OOM killer, hard reset, etc.

Yep, run into this with my implementation. Just had to clean up 10 ASG's that didn't get destroyed. Wasn't sure if that was because of how I did it or other reasons, seems like this response answers those questions. Explicitly killing the cluster or managing through cloudformation as discussed elsewhere is probably the best way to handle this.

I think the assumption should be that the user should be aware of all resources getting created and must assume responsibility of verifying that all resources have spun down accordingly.

I'm not sure I want to put this burden on to novice users. Many folks who will be using this library will not be familiar enough with the cloud provider they are using to understand what resources are being created for them.

Cool, might want to add that this is to allow a novice user to spin up resources in the cloud without much trouble to the high level goals

Maybe a better idea would be an explicit cluster.close() or cluster.destroy() method that a user can run to destroy all created resources.

Yes we should. This is likely covered in #9.

cool, I'll write a comment there to split this off into another discussion

jacobtomlinson commented 5 years ago

Jinja is probably fine, though maybe using the standard cloudformation parameters would work better.

Ah yeah ok! I'm not as familiar with CloudFormation as I am with Terraform.

How would things work in a case where some users may want to have a VPC created, some may want the default VPC to be picked up automatically and others may want to use an existing one that they specify? Is that kind of thing possible?

Explicitly killing the cluster or managing through cloudformation as discussed elsewhere is probably the best way to handle this.

Yes, but ideally the dask cluster should scale down to zero if it is left idling for too long. This currently happens in the FargateCluster where schedulers and workers time out which causes the tasks to exit and costs to stop. In an EC2 environment we would also need the ASG to scale to zero.

Cool, might want to add that this is to allow a novice user to spin up resources in the cloud without much trouble to the high level goals

This is what I meant by This library should provide a "zero-to-dask" experience on all cloud providers.. Zero referring both to knowledge and existing infrastrcture.

kylejn27 commented 5 years ago

Ah yeah ok! I'm not as familiar with CloudFormation as I am with Terraform.

How would things work in a case where some users may want to have a VPC created, some may want the default VPC to be picked up automatically and others may want to use an existing one that they specify? Is that kind of thing possible?

I'm not sure about this.. Maybe theres a fun way to accomplish this with a combination of python and cloudformation templates

Yes, but ideally the dask cluster should scale down to zero if it is left idling for too long. This currently happens in the FargateCluster where schedulers and workers time out which causes the tasks to exit and costs to stop. In an EC2 environment we would also need the ASG to scale to zero.

hmm, so maybe a hybrid of boto and cloudformation. boto to manage the tasks and set the asg to 0 and cloudformation to manage creation and destruction of resources then? Probably some iteration that should happen here

This is what I meant by This library should provide a "zero-to-dask" experience on all cloud providers.. Zero referring both to knowledge and existing infrastrcture.

Ah of course, that makes sense

jacobtomlinson commented 5 years ago

hmm, so maybe a hybrid of boto and cloudformation. boto to manage the tasks and set the asg to 0 and cloudformation to manage creation and destruction of resources then? Probably some iteration that should happen here

Yeah this sounds like an interesting approach. I'll start exploring how we could do this.

H4dr1en commented 4 years ago

Any update? This looked very promising!

kylejn27 commented 4 years ago

I ended up writing custom terraform scripts, it worked better with what dask-cloudprovider expects. I'll look into whether I can generalize them and release them to a personal repository.

H4dr1en commented 4 years ago

Ok, so the right way for the moment would be to use the FargateCluster?

kylejn27 commented 4 years ago

Yes, I had some fairly strict requirements that prevented me from using Fargate. If I didn't have those requirements Fargate would've been my first choice

jacobtomlinson commented 4 years ago

Ok, so the right way for the moment would be to use the FargateCluster?

You also have the option to create your own cluster as per the documentation. But it is a manual step for now.