dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

Dask on Amazon EMR vs. ECS/Fargate #103

Closed jennakwon06 closed 4 years ago

jennakwon06 commented 4 years ago

Hi all,

I am considering between two options for deploying Dask on AWS - Amazon EMR (with dask-yarn) and Amazon ECS + EC2 launch type.

Would you guys be able to shed some light on what are the advantages of using Amazon EMR (with dask-yarn) over deploying Dask scheduler/workers on Amazon ECS + EC2?

I understand Amazon EMR can configure many libraries (pig, hive, spark, etc..) on them. But I won't need those since I will just be using Dask.

Auto scaling of instances is available on both Amazon EMR and Amazon ECS + EC2 launch type.

Amazon EMR is more expensive than Amazon EC2 so I am trying to understand why I would use Amazon EMR to use Dask applications.

Thanks all!

Best, Jenna

jcrist commented 4 years ago

Good question! IMO the only reason to use EMR is if:

If you're just trying to run Dask on autoscaling cloud resources, I recommend you look at dask-kubernetes or dask-cloudprovider instead.

jennakwon06 commented 4 years ago

Great.. thanks!

How do those two solutions compare? W.r.t. AWS - Would it be the same as comparing AWS EKS vs. AWS ECS ?

What is most managed / recommended by Dask? I read below from https://docs.dask.org/en/latest/setup/cloud.html ;

"Dask previously maintained libraries for deploying Dask on Amazon’s EC2 and Google GKE. Due to sporadic interest, and churn both within the Dask library and EC2 itself, these were not well maintained. They have since been deprecated in favor of the Kubernetes and Helm solution."

Is this still true? And if it is, is AWS EKS a better option?

jcrist commented 4 years ago

In general most providers have a kubernetes service (e.g. EKS) and a raw containers service (e.g. ECS). IMO they're both ~the same for dask-specific use, and the decision would come down to pricing and org policy (some orgs don't allow kubernetes).

"Dask previously maintained libraries for deploying Dask on Amazon’s EC2 and Google GKE. Due to sporadic interest, and churn both within the Dask library and EC2 itself, these were not well maintained. They have since been deprecated in favor of the Kubernetes and Helm solution."

Backend specific CLI solutions used to exist (like dask-ec2) but were hard to maintain and since lapsed. dask-cloudprovider is a more recent solution in this space and is currently maintained. It only supports AWS currently, but other backends are planned.

In general, dask-kubernetes or dask-cloudprovider would both solve your problem. dask-kubernetes has existed for longer and is likely less buggy, but dask-cloudprovider also works. I don't really have any advice on picking one over the other, for dask use on most clouds kubernetes pricing will be the same as the container service (charged mainly for containers, we don't create any load balancers), so it's mostly a preference decision here.

Does that help?

jennakwon06 commented 4 years ago

Yes that helps tremendously. Thank you so much!

Any insight on scaling limits? We would be working with few TBs of data. Dask should be able to support it, and cluster solution shouldn't matter here (as long as it can be scaled), right?

jcrist commented 4 years ago

Yeah, scaling is really limited only by your application code. Cloud backends should have no trouble giving you that many resources, and our cluster managers will easily support that many workers. People have used dask with very large clusters just fine.

jennakwon06 commented 4 years ago

Sounds good. Thank you for all the help!