apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.78k stars 4.59k forks source link

[Bug][Registry SPI] Registry SPI Bugs #5678

Closed chengshiwen closed 3 years ago

chengshiwen commented 3 years ago

To Reproduce

  1. Use the dev branch (d64e3cb) and build docker image
  2. Run 'docker-compose up -d'
  3. Go to ZooKeeper menu in Monitor
  4. See loading error
  5. Create a workflow and run
  6. See exception in worker logs

Expected behavior Bug fixed

Screenshots If applicable, add screenshots to help explain your problem. image

worker logs:

[INFO] 2021-06-22 13:58:08.803 org.apache.curator.framework.imps.CuratorFrameworkImpl:[356] - Default schema
[INFO] 2021-06-22 13:58:08.821 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server dolphinscheduler-zookeeper/172.18.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2021-06-22 13:58:08.850 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to dolphinscheduler-zookeeper/172.18.0.3:2181, initiating session
[INFO] 2021-06-22 13:58:09.100 org.apache.zookeeper.ClientCnxn:[1299] - Session establishment complete on server dolphinscheduler-zookeeper/172.18.0.3:2181, sessionid = 0x10014276ac80000, negotiated timeout = 4000
[INFO] 2021-06-22 13:58:09.126 org.apache.curator.framework.state.ConnectionStateManager:[251] - State change: CONNECTED
[WARN] 2021-06-22 13:58:09.243 org.apache.curator.utils.ZKPaths:[78] - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
[INFO] 2021-06-22 13:58:12.787 org.apache.dolphinscheduler.remote.NettyRemotingServer:[178] - NettyRemotingServer bind success at port : 1234
[INFO] 2021-06-22 13:58:12.974 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[97] - worker node : 172.18.0.5:1234 registry to ZK /nodes/worker/default/172.18.0.5:1234 successfully
[INFO] 2021-06-22 13:58:13.062 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[109] - worker node : 172.18.0.5:1234 heartbeat interval 10 s
[ERROR] 2021-06-22 13:58:13.674 org.apache.dolphinscheduler.server.worker.WorkerServer:[140] - zookeeper get children error
org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.getChildren(ZookeeperRegistry.java:262)
    at org.apache.dolphinscheduler.service.registry.RegistryCenter.getChildrenKeys(RegistryCenter.java:263)
    at org.apache.dolphinscheduler.service.registry.RegistryClient.removeDeadServerByHost(RegistryClient.java:438)
    at org.apache.dolphinscheduler.service.registry.RegistryClient.handleDeadServer(RegistryClient.java:395)
    at org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient.handleDeadServer(WorkerRegistryClient.java:150)
    at org.apache.dolphinscheduler.server.worker.WorkerServer.run(WorkerServer.java:138)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:363)
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:307)
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:136)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:413)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1761)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:592)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:514)
    at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:321)
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234)
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:319)
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:199)
    at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:866)
    at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:878)
    at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:550)
    at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:744)
    at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:391)
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:312)
    at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:140)
    at org.apache.dolphinscheduler.server.worker.WorkerServer.main(WorkerServer.java:111)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /dolphinscheduler/dead-servers
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1659)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231)
    at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41)
    at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.getChildren(ZookeeperRegistry.java:258)
    ... 28 common frames omitted
13:58:14.180 [Worker-Server] ERROR org.springframework.boot.SpringApplication - Application run failed
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'workerServer': Invocation of init method failed; nested exception is java.lang.RuntimeException: org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:139) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:413) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1761) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:592) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:514) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:321) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:319) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:199) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:866) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:878) ~[spring-context-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:550) ~[spring-context-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:744) [spring-boot-2.1.18.RELEASE.jar:2.1.18.RELEASE]
    at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:391) [spring-boot-2.1.18.RELEASE.jar:2.1.18.RELEASE]
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:312) [spring-boot-2.1.18.RELEASE.jar:2.1.18.RELEASE]
    at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:140) [spring-boot-2.1.18.RELEASE.jar:2.1.18.RELEASE]
    at org.apache.dolphinscheduler.server.worker.WorkerServer.main(WorkerServer.java:111) [dolphinscheduler-server-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
Caused by: java.lang.RuntimeException: org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.apache.dolphinscheduler.server.worker.WorkerServer.run(WorkerServer.java:141) ~[dolphinscheduler-server-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:363) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:307) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:136) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    ... 16 more
Caused by: org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.getChildren(ZookeeperRegistry.java:262) ~[?:?]
    at org.apache.dolphinscheduler.service.registry.RegistryCenter.getChildrenKeys(RegistryCenter.java:263) ~[dolphinscheduler-service-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.service.registry.RegistryClient.removeDeadServerByHost(RegistryClient.java:438) ~[dolphinscheduler-service-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.service.registry.RegistryClient.handleDeadServer(RegistryClient.java:395) ~[dolphinscheduler-service-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient.handleDeadServer(WorkerRegistryClient.java:150) ~[dolphinscheduler-server-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.server.worker.WorkerServer.run(WorkerServer.java:138) ~[dolphinscheduler-server-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:363) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:307) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:136) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    ... 16 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /dolphinscheduler/dead-servers
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:114) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1659) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242) ~[curator-framework-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231) ~[curator-framework-4.3.0.jar:4.3.0]
    at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67) ~[curator-client-4.3.0.jar:?]
    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81) ~[curator-client-4.3.0.jar:?]
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228) ~[curator-framework-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219) ~[curator-framework-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41) ~[curator-framework-4.3.0.jar:4.3.0]
    at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.getChildren(ZookeeperRegistry.java:258) ~[?:?]
    at org.apache.dolphinscheduler.service.registry.RegistryCenter.getChildrenKeys(RegistryCenter.java:263) ~[dolphinscheduler-service-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.service.registry.RegistryClient.removeDeadServerByHost(RegistryClient.java:438) ~[dolphinscheduler-service-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.service.registry.RegistryClient.handleDeadServer(RegistryClient.java:395) ~[dolphinscheduler-service-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient.handleDeadServer(WorkerRegistryClient.java:150) ~[dolphinscheduler-server-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at org.apache.dolphinscheduler.server.worker.WorkerServer.run(WorkerServer.java:138) ~[dolphinscheduler-server-1.3.6-SNAPSHOT.jar:1.3.6-SNAPSHOT]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:363) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:307) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:136) ~[spring-beans-5.1.19.RELEASE.jar:5.1.19.RELEASE]
    ... 16 more
Exception in thread "Worker-Server" org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'workerServer': Invocation of init method failed; nested exception is java.lang.RuntimeException: org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:139)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:413)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1761)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:592)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:514)
    at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:321)
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234)
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:319)
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:199)
    at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:866)
    at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:878)
    at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:550)
    at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:744)
    at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:391)
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:312)
    at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:140)
    at org.apache.dolphinscheduler.server.worker.WorkerServer.main(WorkerServer.java:111)
Caused by: java.lang.RuntimeException: org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.apache.dolphinscheduler.server.worker.WorkerServer.run(WorkerServer.java:141)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:363)
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:307)
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:136)
    ... 16 more
Caused by: org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper get children error
    at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.getChildren(ZookeeperRegistry.java:262)
    at org.apache.dolphinscheduler.service.registry.RegistryCenter.getChildrenKeys(RegistryCenter.java:263)
    at org.apache.dolphinscheduler.service.registry.RegistryClient.removeDeadServerByHost(RegistryClient.java:438)
    at org.apache.dolphinscheduler.service.registry.RegistryClient.handleDeadServer(RegistryClient.java:395)
    at org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient.handleDeadServer(WorkerRegistryClient.java:150)
    at org.apache.dolphinscheduler.server.worker.WorkerServer.run(WorkerServer.java:138)
    ... 23 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /dolphinscheduler/dead-servers
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:114)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1659)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231)
    at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41)
    at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.getChildren(ZookeeperRegistry.java:258)
    ... 28 more

task execution and netty error log

[INFO] 2021-06-22 14:01:37.702 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[146] - task instance local execute path : /tmp/dolphinscheduler/exec/process/488610701312/488611487744_1/1/1
[INFO] 2021-06-22 14:01:37.718 org.apache.dolphinscheduler.common.utils.FileUtils:[145] - create dir success /tmp/dolphinscheduler/exec/process/488610701312/488611487744_1/1/1
[INFO] 2021-06-22 14:01:37.720  - [taskAppId=TASK-488611487744_1-1-1]:[145] - create dir success /tmp/dolphinscheduler/exec/process/488610701312/488611487744_1/1/1
[INFO] 2021-06-22 14:01:37.805 org.apache.dolphinscheduler.common.utils.OSUtils:[145] - create linux os user : ds
[INFO] 2021-06-22 14:01:37.806 org.apache.dolphinscheduler.common.utils.OSUtils:[145] - execute cmd : sudo useradd -g root
 ds
[INFO] 2021-06-22 14:01:38.015 org.apache.dolphinscheduler.common.utils.OSUtils:[145] - create user ds success
[ERROR] 2021-06-22 14:01:38.102 org.apache.dolphinscheduler.remote.handler.NettyServerHandler:[131] - process msg Command [type=TASK_EXECUTE_REQUEST, opaque=1, bodyLen=1535] error
java.lang.IllegalStateException: org.springframework.context.annotation.AnnotationConfigApplicationContext@5b3f61ff has not been refreshed yet
    at org.springframework.context.support.AbstractApplicationContext.assertBeanFactoryActive(AbstractApplicationContext.java:1093)
    at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1123)
    at org.apache.dolphinscheduler.service.bean.SpringApplicationContext.getBean(SpringApplicationContext.java:44)
    at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.<init>(TaskExecuteThread.java:113)
    at org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor.process(TaskExecuteProcessor.java:180)
    at org.apache.dolphinscheduler.remote.handler.NettyServerHandler.lambda$processReceived$0(NettyServerHandler.java:129)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
14:01:55.951 [NettyServerWorkerThread_1] DEBUG io.netty.util.Recycler - -Dio.netty.recycler.maxCapacityPerThread: 4096
14:01:55.956 [NettyServerWorkerThread_1] DEBUG io.netty.util.Recycler - -Dio.netty.recycler.maxSharedCapacityFactor: 2
14:01:55.956 [NettyServerWorkerThread_1] DEBUG io.netty.util.Recycler - -Dio.netty.recycler.linkCapacity: 16
14:01:55.956 [NettyServerWorkerThread_1] DEBUG io.netty.util.Recycler - -Dio.netty.recycler.ratio: 8
14:01:55.956 [NettyServerWorkerThread_1] DEBUG io.netty.util.Recycler - -Dio.netty.recycler.delayedQueue.ratio: 8
14:01:56.005 [NettyServerWorkerThread_1] DEBUG io.netty.buffer.AbstractByteBuf - -Dio.netty.buffer.checkAccessible: true
14:01:56.005 [NettyServerWorkerThread_1] DEBUG io.netty.buffer.AbstractByteBuf - -Dio.netty.buffer.checkBounds: true
14:01:56.007 [NettyServerWorkerThread_1] DEBUG io.netty.util.ResourceLeakDetectorFactory - Loaded default ResourceLeakDetector: io.netty.util.ResourceLeakDetector@16db704e
14:01:57.088 [pool-2-thread-1] INFO org.apache.dolphinscheduler.server.log.LoggerRequestProcessor - received command : Command [type=ROLL_VIEW_LOG_REQUEST, opaque=1, bodyLen=115]
14:02:14.469 [pool-2-thread-2] INFO org.apache.dolphinscheduler.server.log.LoggerRequestProcessor - received command : Command [type=ROLL_VIEW_LOG_REQUEST, opaque=3, bodyLen=115]
[INFO] 2021-06-22 14:04:39.667 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[118] - received command : TaskExecuteRequestCommand{taskExecutionContext='{"taskInstanceId":2,"taskName":"test","firstSubmitTime":"2021-06-22 14:04:39","startTime":null,"taskType":"SHELL","host":null,"executePath":null,"logPath":null,"taskJson":null,"processId":0,"processDefineCode":488611487744,"processDefineVersion":1,"appIds":null,"processInstanceId":2,"scheduleTime":null,"globalParams":null,"executorId":1,"cmdTypeIfComplement":0,"tenantCode":"ds","queue":"default","projectCode":488610701312,"taskParams":"{\"resourceList\":[],\"localParams\":[],\"rawScript\":\"echo \\\"test\\\"\",\"conditionResult\":\"{\\\"successNode\\\":[\\\"\\\"],\\\"failedNode\\\":[\\\"\\\"]}\",\"dependence\":\"{}\"}","envFile":null,"definedParams":null,"taskAppId":null,"taskTimeoutStrategy":null,"taskTimeout":2147483647,"workerGroup":"default","delayTime":0,"currentExecutionStatus":null,"resources":{},"sqlTaskExecutionContext":{"warningGroupId":0,"connectionParams":null,"udfFuncTenantCodeMap":null},"dataxTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"dependenceTaskExecutionContext":null,"sqoopTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"procedureTaskExecutionContext":{"connectionParams":null}}'}
[INFO] 2021-06-22 14:04:39.684 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[146] - task instance local execute path : /tmp/dolphinscheduler/exec/process/488610701312/488611487744_1/2/2
[INFO] 2021-06-22 14:04:39.686 org.apache.dolphinscheduler.common.utils.FileUtils:[145] - create dir success /tmp/dolphinscheduler/exec/process/488610701312/488611487744_1/2/2
[INFO] 2021-06-22 14:04:39.688  - [taskAppId=TASK-488611487744_1-2-2]:[145] - create dir success /tmp/dolphinscheduler/exec/process/488610701312/488611487744_1/2/2
[ERROR] 2021-06-22 14:04:39.710 org.apache.dolphinscheduler.remote.handler.NettyServerHandler:[131] - process msg Command [type=TASK_EXECUTE_REQUEST, opaque=34, bodyLen=1535] error
java.lang.IllegalStateException: org.springframework.context.annotation.AnnotationConfigApplicationContext@5b3f61ff has not been refreshed yet
    at org.springframework.context.support.AbstractApplicationContext.assertBeanFactoryActive(AbstractApplicationContext.java:1093)
    at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1123)
    at org.apache.dolphinscheduler.service.bean.SpringApplicationContext.getBean(SpringApplicationContext.java:44)
    at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.<init>(TaskExecuteThread.java:113)
    at org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor.process(TaskExecuteProcessor.java:180)
    at org.apache.dolphinscheduler.remote.handler.NettyServerHandler.lambda$processReceived$0(NettyServerHandler.java:129)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Which version of Dolphin Scheduler: -[dev] (d64e3cb)

Additional context Add any other context about the problem here.

Requirement or improvement

CalvinKirs commented 3 years ago

hi,I want to make sure that the master is not started at this time?

chengshiwen commented 3 years ago

hi,I want to make sure that the master is not started at this time?

Started, both process instance and task instance are still running, but the state has not been updated

CalvinKirs commented 3 years ago

hi,I want to make sure that the master is not started at this time?

Started, both process instance and task instance are still running, but the state has not been updated

I see that the exception is No Node (dead_server), this node will be set only by master startup. If the master is started after the worker, then there will be this error. So we may need to set these nodes as registry properties, i.e. all nodes that interface to the registry will set this node by default (not just master).

But now I'm not sure if this is the problem.

The monitoring page on the interface has no data returned from the backend for now, we discuss on the mailing list whether we need to remove it.