Open YANGJINJUE opened 3 years ago
What's your ZookeeperConfiguration?
We set the retry policy org.apache.curator.retry.ExponentialBackoffRetry
when initializing the Curator client. If retry times exceeded, the instance will not recover.
What's your ZookeeperConfiguration? We set the retry policy
org.apache.curator.retry.ExponentialBackoffRetry
when initializing the Curator client. If retry times exceeded, the instance will not recover the code schedulerFacade.shutdownInstance() may execute when zookeeper cluster session expire
We also have the same problem,when zookeeper connect is not stable,some of the elastic jobs might be shutdown by self. 2021-05-12 11:07:12,276 INFO [userdemo-infra] [main] [org.quartz.core.QuartzScheduler:666] - trace[] Scheduler quartzScheduler_$_NONCLUSTERED shutting down. 2021-05-12 11:07:12,276 INFO [userdemo-infra] [main] [org.quartz.core.QuartzScheduler:585] - trace[] Scheduler quartzScheduler$_NONCLUSTERED paused. 2021-05-12 11:07:12,280 INFO [userdemo-infra] [main] [org.quartz.core.QuartzScheduler:740] - trace[] Scheduler quartzScheduler$_NON_CLUSTERED shutdown complete.
the code schedulerFacade.shutdownInstance() may execute when zookeeper cluster session expire @YANGJINJUE @zewade @TeslaCN Has this problem been solved?We also have the same problem.
20:32:17.898 [Curator-ConnectionStateManager-0] INFO com.dangdang.ddframe.job.lite.internal.listener.RegistryCenterConnectionStateListener - jobName:myJob start pauseJob
20:32:17.898 [nioEventLoopGroup-4-1] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x10157a410ab0036, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,3571,-101 request:: '/myJob/myJob/instances/10.30.65.185@-@22420,F response::
20:32:17.898 [Curator-ConnectionStateManager-0] INFO com.dangdang.ddframe.job.lite.internal.listener.RegistryCenterConnectionStateListener - jobName:myJob end pauseJob
20:32:17.898 [myJob_QuartzSchedulerThread] DEBUG org.quartz.core.QuartzSchedulerThread - batch acquisition of 0 triggers
20:32:17.898 [Curator-TreeCache-0] INFO com.dangdang.ddframe.job.lite.internal.instance.ShutdownListenerManager - start shutdownInstance jobName:myJob, path: /myJob/instances/10.30.65.185@-@22420, data:
20:32:17.898 [Curator-ConnectionStateManager-0] DEBUG org.apache.curator.framework.recipes.cache.TreeCache - publishEvent: TreeCacheEvent{type=CONNECTION_LOST, data=null}
20:32:17.899 [Curator-ConnectionStateManager-0] INFO com.dangdang.ddframe.job.lite.internal.listener.RegistryCenterConnectionStateListener - jobName:myJob client state changed to LOST
org.apache.zookeeper.ClientCnxn.SendThread#sendPing add breakpoints to simulate session expiration and trigger temporary node deletion. There is a possibility that this problem recurs. @TeslaCN
use 3.0.0-alpha
project: ElasticJob-Lite
Expected behavior
the job will recovery when the zookeeper cluster recovery
Actual behavior
there is a job shutdown by self when the zookeeper cluster recovery
Reason analyze (If you can)
I guess the class ShutdownListenerManager have a bug at line 57
protected void dataChanged(final String path, final Type eventType, final String data) { if (!JobRegistry.getInstance().isShutdown(jobName) && !JobRegistry.getInstance().getJobScheduleController(jobName).isPaused() && isRemoveInstance(path, eventType) && !isReconnectedRegistryCenter()) { schedulerFacade.shutdownInstance(); } }
Steps to reproduce the behavior.
multiple kill the zookeeper cluster and recovery
Example codes for reproduce this issue (such as a github link).