astronomy-commons / aws-hub

A JupyterHub on Kubernetes gateway to EC2 instances on AWS
0 stars 0 forks source link

Speed up deployment #1

Open connolly opened 4 years ago

connolly commented 4 years ago

Current deployment from login to spark cluster is about 10 mins (5 mins to deploy the EC2 VMs and 5 mins to deploy the pod). Break down of timing from @stevenstetzler is

Mainly it's the allocation and creation of EC2 virtual machines. The workflow is: (1) request resource in Kubernetes -> (2) Kubernetes scheduler tries to schedule new pods (Jupyter notebook or spark executor) -> (3) if need more nodes to accommodate the pod, cluster autoscaler ask for new nodes from AWS -> (4) AWS creates N more virtual machines to accommodate request -> (5) Once virtual machines are up, pods get placed on them, docker images get pulled and docker containers start on those machines -> (6) in either case of Jupyter or Spark, the pod that asked for the new pods to be created sends a ping to check if the new pods are ready (in spark this is when you see the new executors added in the job timeline, when the executor pod pings back "I'm alive"). (1) and (2) is almost instant, but I imagine will get slower as the Kubernetes cluster is used more (probably not but much) (3) can take some time depending on the load on the cluster autoscaler. Sometimes up to a minute, but is usually on the order of tens of seconds. (4) is the main bottleneck. Try creating a new EC2 virtual machine and see how fast it is, on the order of minutes. (5) this depends on how large our images are, the remoteness of the docker repository (Docker Hub vs AWS ECR for example, are the images on site or not), and network speed of the nodes those containers are sitting on. On the order of tens of seconds to a minute. (6) can take a second or a while based on what scripts run at container start up. Right now order of seconds.

mjuric commented 4 years ago

(note: copying my comments from the Slack thread)

It's tough to speed it up w/o preallocating a few machines (i.e., set up the autoscaler to always keep a buffer of ~N free machines, to be immediately available when the next user(s) connect). But that costs money.

A workaround is to warn the user that the cluster will take 10 minutes to spin up. They'll be less annoyed if they're aware of this (and will incorporate the delay into the workflow).

One thing to look into: Fargate -- https://aws.amazon.com/fargate/ -- this is supposed to enable one to run containers w/o specifying a server on which to run them. I'm not sure if that means the spinup is faster. The thing to look at is how they intract with EKS -- amazon just announced the tie-in on reInvent, but I didn't get a chance to read about it. It does look potentially promising: