cjuexuan / mynote

237 stars 34 forks source link

一次主机端口占满的排查 #55

Open cjuexuan opened 6 years ago

cjuexuan commented 6 years ago

一次主机端口占满的排查

背景

今天早上小伙伴找到我,说他们任务有点问题,让我帮忙排查下,错误日志如下

[data]07:44:56 476  INFO (org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint:54) - Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.17.224:33876) with ID 1
[data]07:44:57 08 ERROR (org.apache.spark.scheduler.cluster.YarnClusterScheduler:70) - Lost executor 1 on sh-bs-3-i1-hadoop-17-224: Unable to create executor due to 地址已在使用: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
[data]07:44:57 15  INFO (org.apache.spark.scheduler.DAGScheduler:54) - Executor lost: 1 (epoch 0)
[data]07:44:57 48  INFO (org.apache.spark.storage.BlockManagerMasterEndpoint:54) - Trying to remove executor 1 from BlockManagerMaster.
[data]07:44:57 50  INFO (org.apache.spark.storage.BlockManagerMaster:54) - Removed 1 successfully in removeExecutor
[data]07:44:57 51  INFO (org.apache.spark.scheduler.DAGScheduler:54) - Shuffle files lost for executor: 1 (epoch 0)
[data]07:44:57 342  WARN (org.apache.spark.network.server.TransportChannelHandler:78) - Exception in connection from /192.168.17.224:33876
java.io.IOException: Connection reset by peer

因为我印象中这些common的服务通常端口都是指定为0,让机器随机端口,一个机器上有几万个端口,除非端口被占用完了,否则应该不会出这样问题,那么就开始排查问题吧

查看主机

到有问题的主机上ss -s查看了下,发现这台主机有几万个个端口在使用

ss

于是用netstat查看了下进程,找到了进程的的pid

netstat

接下来找到了进程号

5-16:57:49 /usr/local/jdk8/bin/java -Xms3840m -Xmx3840m -Dlog.file=/data/hdfs/yarn/logs/application_1518350103530_720283/container_1518350103530_720283_01_000007/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskManager --configDir .

这是一个flink任务的TaskManager

接下来我去yarn上找到了这个application的其他的taskManager,发现也有连接泄漏的问题

分析

从netstat的图中很容易都是连到了datanode的50010端口,那么大概率就是内部操作hdfs的代码出现了连接的泄漏,于是进行了代码分析,果然发现有对open的FSDataInputStream没有close的代码,应该是主要问题所在,接下来就是通知业务方改代码了,上线之后的效果会再观察下