K8s jobs slow due to hard cpu limits

apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

https://spark.apache.org/

Apache License 2.0

612 stars 118 forks source link

K8s jobs slow due to hard cpu limits #352

Closed FANNG1 closed 7 years ago

FANNG1 commented 7 years ago

run a simple wordCount app, comparing the running time on yarn and k8s. jobs on k8s are much slower,

some time are wasted on pod start(10s more)
the map task execute time (espically the first task) also takes much more time than app on yarn(6s more). notice that k8s jobs takes more gc time, Is anyone else encounter this? and known why?

FANNG1 commented 7 years ago

some configure:

spark.eventLog.dir hdfs://10.196.132.104:49000/spark/eventlog spark.history.fs.logDirectory hdfs://10.196.132.104:49000/spark/eventlog spark.eventLog.enabled true spark.dynamicAllocation.enabled true spark.dynamicAllocation.initialExecutors 1 spark.dynamicAllocation.maxExecutors 1 spark.executor.memory 1g spark.executor.cores 1 spark.driver.memory 1g

spark.shuffle.service.enabled true spark.kubernetes.shuffle.namespace default spark.kubernetes.shuffle.dir /data2/spark spark.local.dir /data2/spark spark.kubernetes.shuffle.labels app=spark-shuffle-service,spark-version=2.1.0

spark.kubernetes.executor.memoryOverhead 5000

FANNG1 commented 7 years ago

update logs, and could see task 0 on k8s tasks 10s, while on yarn takes only 4s. kubelet and NM are on the same machine, had same jvm paramters ,please skip the logs added for debug yarn executor log k8s executor log

FANNG1 commented 7 years ago

found the main reason is in k8s cpu limit is the same as cpu request, in this case is 1 cores, while in yarn, executor could exceed cpu request,add a pull request to add "spark.kubernetes.executor.limit.cores" to specify cpu limit of executor

ash211 commented 7 years ago

Great debugging @sandflee ! What you're proposing around cpu limits is definitely a plausible explanation for some of the discrepancy in performance between YARN and k8s.

Looking at these YARN docs it appears that YARN doesn't do cpu limiting of a YARN container at the OS level without configuring cgroups, and because it's not on by default probably most YARN installations therefore don't have cpu limiting. So I suspect when a YARN scheduler allocates vcores to an application the number of cores is actually just an un-enforced request to the application, which may exceed the specified core usage either intentionally or unintentionally.

So in your testing you were probably comparing a strictly 1core executor in k8s against a 1core-but-actually-unlimited executor in YARN. Indeed those performance results should be different!

Bottom line is, making this core limit configurable makes sense to me.

I suspect in many of my own deployments I would want to emulate the YARN behavior of allowing the pod to use an unlimited amount of CPU on the kubelet, especially when no other pods are on the kubelet. Do you (or @foxish) know a way to set no cpu limit, or unlimited cpu?

foxish commented 7 years ago

Setting a request and no limit would be the way to make sure that there is only a minimum guarantee and no upper bound on usage. Making that the default makes sense to me. We should have the limit be optional using the option that @sandflee wrote in his PR.

foxish commented 7 years ago

CPU is considered a "compressible" resource (unlike memory), so, there is no harm in making it unbounded. If the system has insufficient cpu, the executor's CPU usage will be throttled (down to the request of 1 CPU), but the executor will not be killed. See also: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#compressible-resource-guarantees

liyinan926 commented 7 years ago

Spark already has spark.cores.max applicable to standalone and Mesos (static) deployments that limits the total number of cores across the whole cluster. So the core limit per executor can be derived as spark.cores.max/spark.executor.cores. Should we consider using spark.cores.max instead of coming up with a Kubernetes specific one, given that we honor spark.executor.cores? This is particularly an issue if an application that already uses spark.cores.max is being migrated from a standalone Spark cluster to run on Kubernetes. It is reasonable to assume that the user of the application expects the same behavior here.

ash211 commented 7 years ago

I don't think we can rely on always using spark.cores.max since it's not set by all applications. Especially those migrating from YARN, where that setting has no effect.

And the arithmetic spark.cores.max/spark.executor.cores would provide the count of executors, rather than the core limit per executor. That's (total cores across cluster) / (cores per executor) = (executor count). Possibly as a different default when spark.cores.max is set we could change the default per-executor cpu limit from unbounded to bounded at spark.cores.max, which applies the whole-cluster cpu limit also as the per-executor cpu limit.

But because cpu is a compressible resource, I think a better default would be unlimited cpu.

liyinan926 commented 7 years ago

My bad, it should be spark.core.max/executor-instances. I agree that the default should be unlimited and the PR is doing the right thing. I'm just concerned about introducing yet another config key who's purpose can be served by using spark.cores.max. It's certainly true that spark.cores.max is not always set. But with a sensible default of unlimited, it's not really an issue. We just need to document it clearly so people know what to do when they do need to limit the cores per executor.

kimoonkim commented 7 years ago

@liyinan926 Thanks for debugging this executor core limit issue. This is quite interesting.

it should be spark.core.max/executor-instances.

I am curious how this would work with dynamic allocation when the number of executors varies as the job goes. Is this suggested only for static allocation? Then, do we need another flag anyway if we want to limit max cores per executor for dynamic allocation?

liyinan926 commented 7 years ago

@kimoonkim That's a good point. Changing the number of executors will cause the limit per executor to change as well if the value of spark.cores.max stays the same. With Kubernetes, it's probably not possible because the cpu limit is specified when an executor pod gets created.

tangzhankun commented 7 years ago

@sandflee Have you tested the result with your new patch?

If I remember correctly, the spec.containers[].resources.requests.cpu is converted to --cpu-shares flag in the docker run command while spec.containers[].resources.limits.cpu is converted to --cpu-quota. So I think that we should declare -cpu-shares with k8s client instead of hard limit cpu time(cpu-quota) if we want to utilize other idle cpu cores?

FANNG1 commented 7 years ago

yes, I have test the patch, set hard limit is a user option with default no limit. user could limit cpu usage if needed, such as co-running a online service and a spark job.

ash211 commented 7 years ago

Closed by https://github.com/apache-spark-on-k8s/spark/pull/356

ash211 commented 7 years ago

Thanks again for debugging this @sandflee ! Happy to discuss any more performance discrepancies you find vs YARN or any other improvements you might find in future issues.

tangzhankun commented 7 years ago

@sandflee I checked the patch and thanks!