dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

Start a Dask cluster with EMR cluster programmatically #121

Open jennakwon06 opened 4 years ago

jennakwon06 commented 4 years ago

Hello,

We want to programmatically spin up an EMR cluster then spin up a Dask cluster in the EMR cluster with YarnCluster construct.

Currently, what we are doing is - open up SSH tunnel to the master node of the EMR cluster (it's in private subnet), log onto the master node, create a .ipynb notebook that has "YarnCluster(..)" code. We execute that cell to spin up the Dask cluster.

It would be nice to automate this; e.g. run some commands to spin up an EMR cluster that also has Dask cluster.

Thanks!

jennakwon06 commented 4 years ago

Or something like - set up the Dask cluster as part of EMR bootstrap - that would be useful.

quasiben commented 4 years ago

This seems like a useful feature though I'm not sure it belongs in dask-yarn. Quickly glancing at boto and it seems like there is support for launching EMR. In fact, I found a blog post on it: https://medium.com/@kulasangar/create-an-emr-cluster-and-submit-a-job-using-boto3-c34134ef68a0. Perhaps someone has time to experiment with connecting boto3 and dask-yarn together ?

jennakwon06 commented 4 years ago

So yes - we are programatically launching an EMR cluster with boto EMR api.

But the manual step is - when EMR cluster is done launching (takes ~5 minutes), log onto the master node of the EMR cluster then run a Jupyter notebook with cell "cluster = YarnCluster(...)".

We then do "Client("ip-node-of-emr-master-node")" to connect to the YarnCluster from somewhere different than EMR master node - like a Jupyter notebook on a SageMaker notebook instance.

So the ideal is - from my SM notebook instance, I can do one call "spin-up-dask-cluster-on-emr-cluster(dask_cluster_settings, emr_cluster_settings)".