apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Performance analysis on job startup time #113

Open ash211 opened 7 years ago

ash211 commented 7 years ago

Startup time is very important for @justinuang on an internal application. Our target time: 10sec between spark-submit and job running in k8s.

Current analysis:

To get better numbers I'd want to turn on millisecond level logging (default is just at the second level).

We think the bolded line (time between rest server JVM ready and submitter submitting app jar) is the place for most improvement. Fully eliminating that would get us to 15sec job startup time.

The next place to pursue further improvements after that might be in merging rest server JVM and driver JVM on the driver pod into the same JVM (reduces the ~2sec JVM startup time).

lins05 commented 7 years ago

merging rest server JVM and driver JVM on the driver pod into the same JVM

Actually that's what YARN cluster mode does, so +1 for it.

foxish commented 7 years ago

That looks awesome. Thanks @ash211 for running those tests. It also verifies that it runs on AWS without issues here, which is great.

mccheah commented 7 years ago

Does the 8 seconds include both bypassing all of the futures and also getting past the initial ping of the remote server? It would be good to distinguish between the time spent on:

ash211 commented 7 years ago

re AWS: this was using plain EC2 with a kubeadm-created cluster, not their container service ECS. But it is good indication that it at least works somewhat in AWS.

For the 8 seconds, that was for both the watches to trigger the futures, as well as the ping. I'm not sure the breakdown between them since I was running a slightly-behind version of our branch that didn't have the logging on k8s resource readiness + ping verification. Will re-run with latest and post new stats.