Open kimoonkim opened 7 years ago
Is it the pod getting OOMKilled by Kubernetes, or is this happening within Spark?
I think it's Kubernetes. (I think I saw "OOM Kill" in the Kubernetes Dashboard UI, but I am not quite sure again because it was hard for me to find the clue). I also was doing $ kubectl logs -f executor-pod
when executors get killed and I didn't see any log message indicating it's Spark.
@varunkatta I see #244 handling VMEM_EXCEEDED
and PMEM_EXCEEDED
. Maybe these are the symptom of OOMKilled by Kubernetes? Maybe we just need clear log messages in the driver?
Sometimes it's killed by JVM java.lang.OutOfMemoryError. And the message is present in driver log.
17/06/29 01:54:39 WARN KubernetesTaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 172.16.75.92, executor 1): java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$2.newArray(Manifest.scala:177) at scala.reflect.ManifestFactory$$anon$2.newArray(Manifest.scala:176) at org.apache.spark.util.collection.KVArraySortDataFormat.allocate(SortDataFormat.scala:109) at org.apache.spark.util.collection.KVArraySortDataFormat.allocate(SortDataFormat.scala:86) at org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)
@foxish
I was doing the HDFS-in-K8s experiment using Spark TeraSort jobs. It turned out the default memory size for executors, which is 1 GB per executor, is way too small for the workload. Executor JVMs would just get killed and restarted. I ended up specifying 6 GB per executor.
Learning the root cause was a painful process though. Because there is no easy way to see why the JVMs get killed. It does not show up in
$ kubectl log
of executor pods. I show a glimpse of that only in the Kubernetes dashboard UI when I was visiting the pod page at the right time.I wonder if there is a better way. I hear a lot that Spark uses lots of memory depending on applications. I fear many people have to go through this troubleshooting without much help.