apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

spark task exit errors #496

Open duyanghao opened 7 years ago

duyanghao commented 7 years ago

Sometimes, the driver exit with following error logs:

2017-09-15 16:11:57 INFO SparkUI: Stopped Spark web UI at http://192.168.9.248:4040
2017-09-15 16:11:57 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
2017-09-15 16:11:57 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(69,W
rappedArray())
2017-09-15 16:11:57 INFO KubernetesClusterSchedulerBackend: Closing kubernetes client
2017-09-15 16:11:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5034_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5034_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5036_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5036_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5032_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5032_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5030_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5030_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5028_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5028_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5026_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5026_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5024_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5024_piece0,StorageLevel(1 replicas),0,0))
2017-09-15 16:11:57 INFO BlockManagerInfo: Removed broadcast_5017_piece0 on 192.168.9.248:22770 in memory (size: 2.1 MB, free: 22.7 GB)
2017-09-15 16:11:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedI
nfo(BlockManagerId(driver, 192.168.9.248, 22770, None),broadcast_5017_piece0,StorageLevel(1 replicas),0,0))

...

2017-09-15 16:11:57 ERROR TransportResponseHandler: Still have 25 requests outstanding when connection from /192.168.21.0:46198 is clos
ed
2017-09-15 16:11:57 INFO SparkContext: Successfully stopped SparkContext
2017-09-15 16:11:57 ERROR TransportResponseHandler: Still have 25 requests outstanding when connection from /192.168.21.0:21597 is clos
ed
2017-09-15 16:11:57 ERROR TransportResponseHandler: Still have 25 requests outstanding when connection from /192.168.1.192:32531 is clo
sed
2017-09-15 16:11:57 ERROR TransportResponseHandler: Still have 25 requests outstanding when connection from /192.168.1.192:19064 is clo
sed
2017-09-15 16:11:57 INFO ShutdownHookManager: Shutdown hook called
2017-09-15 16:11:57 INFO ShutdownHookManager: Deleting directory /tmp/spark-fda9ca31-ca28-418e-8efd-ecab171b8f89

while some executors are reporting following logs:

2017-09-15 16:11:58 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=9092046032893137009, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
        at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:111)
        at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
        at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
        at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:313)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:770)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1256)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
        at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
        at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
        at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
        at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
        at io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:36)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1127)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1061)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:745)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=7029619020740092355, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=5863212271784827014, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=5176472406356179021, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=4748371057720592657, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=6487692282055856464, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=6977859217025701339, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=6982414833681962855, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=7043496566615187418, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-09-15 16:11:58 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=7522995429907086570, body=NioManagedBuffe
r{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]}} to /192.168.9.248:17446; closing connection
java.nio.channels.ClosedChannelException

...

2017-09-15 16:11:58 INFO CoarseGrainedExecutorBackend: Driver from 192.168.9.248:17446 disconnected during shutdown
2017-09-15 16:11:58 INFO CoarseGrainedExecutorBackend: Driver from 192.168.9.248:17446 disconnected during shutdown
2017-09-15 16:12:06 INFO MemoryStore: MemoryStore cleared
2017-09-15 16:12:06 INFO BlockManager: BlockManager stopped
2017-09-15 16:12:06 INFO ShutdownHookManager: Shutdown hook called
2017-09-15 16:12:06 INFO ShutdownHookManager: Deleting directory /tmp/spark-27a5b8f7-b5db-432d-b883-427cabdacac8
foxish commented 7 years ago

@ash211, have you seen this error?

rootsongjc commented 7 years ago

@duyanghao How do you submit the job? Can you attach the commands?

duyanghao commented 7 years ago

@rootsongjc

How do you submit the job? Can you attach the commands?

bin/spark-submit \
--deploy-mode=cluster \
--master=k8s://https://xxx \
--kubernetes-namespace=xxx \
--conf=spark.kubernetes.driver.docker.image=xxx \
--conf=spark.kubernetes.executor.docker.image=xxx \
--conf=spark.kubernetes.initcontainer.docker.image=xxx \
--conf=spark.app.name=driver-hang \
--conf=spark.driver.memory=4096M \
--conf=spark.driver.cores=4 \
--conf=spark.executor.instances=100 \
--conf=spark.executor.memory=4096M \
--conf=spark.executor.cores=4 \
--conf=spark.kubernetes.submission.waitAppCompletion=false \
--conf=spark.ui.showConsoleProgress=false \
--class=xxx \
xxx.jar

Nothing particularly !!!

rootsongjc commented 7 years ago

https://github.com/rootsongjc/kubernetes-handbook/blob/master/usecases/running-spark-with-kubernetes-native-scheduler.md How I run spark on kubernetes didn't see your errors.

duyanghao commented 7 years ago

See this page https://github.com/rootsongjc/kubernetes-handbook/blob/master/usecases/running-spark-with-kubernetes-native-scheduler.md How I run spark on kubernetes

@rootsongjc so what do you mean by this?

Is there anything wrong with the submit command?

rootsongjc commented 7 years ago

Could you illustrate your kubernetes cluster environment especially the networking plan? I can't figure out what's wrong in your command.

duyanghao commented 7 years ago

@rootsongjc I don't think that's the main point of this error. After all, hundreds of tasks are running in my cluster.

And i do think it is the logic bug of spark on k8s from the log.