DataLinkDC / dinky

Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
http://www.dinky.org.cn
Apache License 2.0
2.92k stars 1.07k forks source link

the status is unknown after the batch task is Finish #3619

Open Aitongong opened 2 days ago

Aitongong commented 2 days ago

Search before asking

What happened

dinky版本:1.0.3 flink提交模式:yarn-application 批任务

以下是dinky上任务状态,任务执行成功了,但是dinky上显示还存在running的算子,且任务状态变为了unknown 1802f9c33e4bf93495d1657bd065dbb4 这是yarn上任务状态 17dbb3c60b7eef81201fa36cd7debdc5

What you expected to happen

[dinky] 2024-06-30 19:00:01 CST INFO org.dinky.service.impl.TaskServiceImpl 181 prepareTask - Start check and config task, task:EDW.MC-MEMBER-FIRST-ADMIN-INFO [dinky] 2024-06-30 19:01:01 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 208 getLocalFlinkDistPath - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar [dinky] 2024-06-30 19:01:01 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 605 deployInternal - Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=2048, slotsPerTaskManager=1} [dinky] 2024-06-30 19:01:01 CST WARN org.apache.flink.core.plugin.PluginConfig 69 getPluginsDir - The plugins directory [plugins] does not exist. [dinky] 2024-06-30 19:01:01 CST INFO org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils 330 capToMinMax - The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead [dinky] 2024-06-30 19:01:01 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 1239 startAppMaster - Submitting application master application_1711554512049_148730 [dinky] 2024-06-30 19:01:01 CST INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl 310 submitApplication - Submitted application application_1711554512049_148730 [dinky] 2024-06-30 19:01:01 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 1242 startAppMaster - Waiting for the cluster to be allocated [dinky] 2024-06-30 19:01:01 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 1277 startAppMaster - Deploying cluster, current state ACCEPTED [dinky] 2024-06-30 19:01:05 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 1270 startAppMaster - YARN application has been deployed successfully. [dinky] 2024-06-30 19:01:05 CST INFO org.apache.flink.yarn.YarnClusterDescriptor 1843 setClusterEntrypointInfoToConfig - Found Web Interface xxx.xxx.xxx.xxx:39955 of application 'application_1711554512049_148730'. [dinky] 2024-06-30 19:01:15 CST INFO org.dinky.service.impl.TaskServiceImpl 196 executeJob - execute job finished,status is SUCCESS [dinky] 2024-06-30 19:01:15 CST INFO org.dinky.service.impl.TaskServiceImpl 330 submitTask - Job Submit success [dinky] 2024-06-30 19:01:28 CST WARN org.dinky.job.handler.JobRefreshHandler 271 getJobData - Connect vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 failed,Unexpected end of file from server [dinky] 2024-06-30 19:01:34 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 [dinky] 2024-06-30 19:01:34 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 [dinky] 2024-06-30 19:01:34 CST WARN org.dinky.job.handler.JobRefreshHandler 271 getJobData - Connect vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 failed,ConnectException: 拒绝连接 (Connection refused) [dinky] 2024-06-30 19:01:39 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 [dinky] 2024-06-30 19:01:39 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 [dinky] 2024-06-30 19:01:44 CST WARN org.dinky.job.handler.JobRefreshHandler 271 getJobData - Connect vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 failed,ConnectException: 拒绝连接 (Connection refused) [dinky] 2024-06-30 19:01:49 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 [dinky] 2024-06-30 19:01:49 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 ......(日志持续了半小时) [dinky] 2024-06-30 19:32:08 CST WARN org.dinky.job.handler.JobRefreshHandler 271 getJobData - Connect vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 failed,ConnectException: 拒绝连接 (Connection refused) [dinky] 2024-06-30 19:32:48 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955 [dinky] 2024-06-30 19:32:48 CST INFO org.dinky.cluster.FlinkCluster 63 executeSocketTest - Flink jobManager 地址排除 -- vm-mysteel-cdh-dw-dn03.mysteeltech.com:39955

How to reproduce

提交一个flink模式:application 批任务,偶发性出现unknown状态

Anything else

No response

Version

dev

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 2 days ago

Hello @Aitongong, this issue is about web, so I assign it to @Zzm0809. If you have any questions, you can comment and reply.

你好 @Aitongong, 这个 issue 是关于 web 的,所以我把它分配给了 @Zzm0809。如有任何问题,可以评论回复。

2413940852 commented 1 day ago

请发一下yarn日志

Aitongong commented 1 day ago

请发一下yarn日志

因为yarn日志比较多,请问具体需要哪部分呢

gaoyan1998 commented 1 day ago

请发一下yarn日志

因为yarn日志比较多,请问具体需要哪部分呢

最后一部分就可以

Aitongong commented 1 hour ago

请发一下yarn日志

因为yarn日志比较多,请问具体需要哪部分呢

最后一部分就可以

2024-07-01 19:01:23,672 INFO org.dinky.app.flinksql.Submitter [] - Execution succeeded. 2024-07-01 19:01:23,674 INFO org.dinky.app.flinksql.Submitter [] - 2024-07-01T19:01:23.674 The task is successfully submitted 2024-07-01 19:01:23,675 INFO org.dinky.app.flinksql.Submitter [] - Start Monitor Job 2024-07-01 19:01:23,882 INFO org.apache.flink.runtime.history.FsJobArchivist [] - Job de26715162c67b4b32e04dd9f641c626 has been archived at hdfs:/user/flink/dinky-completed-jobs/de26715162c67b4b32e04dd9f641c626. 2024-07-01 19:01:23,884 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job de26715162c67b4b32e04dd9f641c626 has been registered for cleanup in the JobResultStore after reaching a terminal state. 2024-07-01 19:01:23,886 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job 'EDW.APP-USER-FIRST-ADMIN-INFO' (de26715162c67b4b32e04dd9f641c626). 2024-07-01 19:01:23,889 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down 2024-07-01 19:01:23,891 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [066679ecf9ee4afd749ed415a8266f3d]. 2024-07-01 19:01:23,892 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [5af9a9808e9f822a8b4a19047acdd71d]. 2024-07-01 19:01:23,892 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [fad449000c1576149cff4bb984be6e74]. 2024-07-01 19:01:23,892 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection 63d7c3348ca8f0d60c1e1c5cdb40b399: Stopping JobMaster for job 'EDW.APP-USER-FIRST-ADMIN-INFO' (de26715162c67b4b32e04dd9f641c626). 2024-07-01 19:01:23,894 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@vm-mysteel-cdh-dw-dn06.mysteeltech.com:42590/user/rpc/jobmanager_2 for job de26715162c67b4b32e04dd9f641c626 from the resource manager. 2024-07-01 19:01:28,717 ERROR org.dinky.app.util.FlinkAppUtil [] - send hook failed,retry later,url:http://xxx.xxx.xxx.xxx:8889/api/jobInstance/hookJobDone?taskId=852&jobId=de26715162c67b4b32e04dd9f641c626, taskId:852,jobId:de26715162c67b4b32e04dd9f641c626,Read timed out 2024-07-01 19:01:29,922 INFO org.dinky.app.util.FlinkAppUtil [] - refesh job status finished, status is FINISHED 2024-07-01 19:01:34,924 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application completed SUCCESSFULLY 2024-07-01 19:01:34,925 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting YarnApplicationClusterEntryPoint down with application status SUCCEEDED. Diagnostics null. 2024-07-01 19:01:34,928 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting down rest endpoint. 2024-07-01 19:01:34,954 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Removing cache directory /tmp/flink-web-19b19815-6a81-4d91-97ac-d640920bd174/flink-web-ui 2024-07-01 19:01:34,955 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - http://xxx.xxx.xxx.xxx:38379 lost leadership 2024-07-01 19:01:34,955 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shut down complete. 2024-07-01 19:01:34,955 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Shut down cluster because application is in SUCCEEDED, diagnostics null. 2024-07-01 19:01:34,957 INFO org.apache.flink.yarn.YarnResourceManagerDriver [] - Unregister application from the YARN Resource Manager with final status SUCCEEDED. 2024-07-01 19:01:34,964 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl [] - Waiting for application to be successfully unregistered. 2024-07-01 19:01:35,089 INFO org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent [] - Closing components. 2024-07-01 19:01:35,089 INFO org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - DefaultDispatcherRunner was revoked the leadership with leader id 00000000-0000-0000-0000-000000000000. Stopping the DispatcherLeaderProcess. 2024-07-01 19:01:35,090 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess. 2024-07-01 19:01:35,090 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping dispatcher akka.tcp://flink@vm-mysteel-cdh-dw-dn06.mysteeltech.com:42590/user/rpc/dispatcher_0. 2024-07-01 19:01:35,090 INFO org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Stopping resource manager service. 2024-07-01 19:01:35,090 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all currently running jobs of dispatcher akka.tcp://flink@vm-mysteel-cdh-dw-dn06.mysteeltech.com:42590/user/rpc/dispatcher_0. 2024-07-01 19:01:35,091 INFO org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Resource manager service is not running. Ignore revoking leadership. 2024-07-01 19:01:35,093 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped dispatcher akka.tcp://flink@vm-mysteel-cdh-dw-dn06.mysteeltech.com:42590/user/rpc/dispatcher_0. 2024-07-01 19:01:35,122 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Closing the slot manager. 2024-07-01 19:01:35,123 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Suspending the slot manager. 2024-07-01 19:01:35,124 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:43272 2024-07-01 19:01:35,126 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopping Akka RPC service. 2024-07-01 19:01:35,129 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopping Akka RPC service. 2024-07-01 19:01:35,177 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon. 2024-07-01 19:01:35,177 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon. 2024-07-01 19:01:35,178 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports. 2024-07-01 19:01:35,178 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports. 2024-07-01 19:01:35,203 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut down. 2024-07-01 19:01:35,203 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut down. 2024-07-01 19:01:35,228 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopped Akka RPC service. 2024-07-01 19:01:35,237 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopped Akka RPC service. 2024-07-01 19:01:35,237 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Terminating cluster entrypoint process YarnApplicationClusterEntryPoint with exit code 0.