Discuss how to make it easier to debug when executors die because of memory limit

apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

https://spark.apache.org/

Apache License 2.0

612 stars 118 forks source link

Discuss how to make it easier to debug when executors die because of memory limit #247

Open kimoonkim opened 7 years ago

kimoonkim commented 7 years ago

@foxish

I was doing the HDFS-in-K8s experiment using Spark TeraSort jobs. It turned out the default memory size for executors, which is 1 GB per executor, is way too small for the workload. Executor JVMs would just get killed and restarted. I ended up specifying 6 GB per executor.

Learning the root cause was a painful process though. Because there is no easy way to see why the JVMs get killed. It does not show up in $ kubectl log of executor pods. I show a glimpse of that only in the Kubernetes dashboard UI when I was visiting the pod page at the right time.

I wonder if there is a better way. I hear a lot that Spark uses lots of memory depending on applications. I fear many people have to go through this troubleshooting without much help.

foxish commented 7 years ago

Is it the pod getting OOMKilled by Kubernetes, or is this happening within Spark?

kimoonkim commented 7 years ago

I think it's Kubernetes. (I think I saw "OOM Kill" in the Kubernetes Dashboard UI, but I am not quite sure again because it was hard for me to find the clue). I also was doing $ kubectl logs -f executor-pod when executors get killed and I didn't see any log message indicating it's Spark.

kimoonkim commented 7 years ago

@varunkatta I see #244 handling VMEM_EXCEEDED and PMEM_EXCEEDED. Maybe these are the symptom of OOMKilled by Kubernetes? Maybe we just need clear log messages in the driver?

honkiko commented 7 years ago

Sometimes it's killed by JVM java.lang.OutOfMemoryError. And the message is present in driver log.

17/06/29 01:54:39 WARN KubernetesTaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 172.16.75.92, executor 1): java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$2.newArray(Manifest.scala:177) at scala.reflect.ManifestFactory$$anon$2.newArray(Manifest.scala:176) at org.apache.spark.util.collection.KVArraySortDataFormat.allocate(SortDataFormat.scala:109) at org.apache.spark.util.collection.KVArraySortDataFormat.allocate(SortDataFormat.scala:86) at org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)