[data]07:44:56 476 INFO (org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint:54) - Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.17.224:33876) with ID 1
[data]07:44:57 08 ERROR (org.apache.spark.scheduler.cluster.YarnClusterScheduler:70) - Lost executor 1 on sh-bs-3-i1-hadoop-17-224: Unable to create executor due to 地址已在使用: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
[data]07:44:57 15 INFO (org.apache.spark.scheduler.DAGScheduler:54) - Executor lost: 1 (epoch 0)
[data]07:44:57 48 INFO (org.apache.spark.storage.BlockManagerMasterEndpoint:54) - Trying to remove executor 1 from BlockManagerMaster.
[data]07:44:57 50 INFO (org.apache.spark.storage.BlockManagerMaster:54) - Removed 1 successfully in removeExecutor
[data]07:44:57 51 INFO (org.apache.spark.scheduler.DAGScheduler:54) - Shuffle files lost for executor: 1 (epoch 0)
[data]07:44:57 342 WARN (org.apache.spark.network.server.TransportChannelHandler:78) - Exception in connection from /192.168.17.224:33876
java.io.IOException: Connection reset by peer
一次主机端口占满的排查
背景
今天早上小伙伴找到我,说他们任务有点问题,让我帮忙排查下,错误日志如下
因为我印象中这些common的服务通常端口都是指定为0,让机器随机端口,一个机器上有几万个端口,除非端口被占用完了,否则应该不会出这样问题,那么就开始排查问题吧
查看主机
到有问题的主机上ss -s查看了下,发现这台主机有几万个个端口在使用
于是用netstat查看了下进程,找到了进程的的pid
接下来找到了进程号
这是一个flink任务的TaskManager
接下来我去yarn上找到了这个application的其他的taskManager,发现也有连接泄漏的问题
分析
从netstat的图中很容易都是连到了datanode的50010端口,那么大概率就是内部操作hdfs的代码出现了连接的泄漏,于是进行了代码分析,果然发现有对open的FSDataInputStream没有close的代码,应该是主要问题所在,接下来就是通知业务方改代码了,上线之后的效果会再观察下