apache / shardingsphere-elasticjob

Distributed scheduled job
Apache License 2.0
8.12k stars 3.28k forks source link

zookeeper cluster killed and recovery,the job shutdown by self #1857

Open YANGJINJUE opened 3 years ago

YANGJINJUE commented 3 years ago

use 3.0.0-alpha

project: ElasticJob-Lite

Expected behavior

the job will recovery when the zookeeper cluster recovery

Actual behavior

there is a job shutdown by self when the zookeeper cluster recovery

Reason analyze (If you can)

I guess the class ShutdownListenerManager have a bug at line 57 protected void dataChanged(final String path, final Type eventType, final String data) { if (!JobRegistry.getInstance().isShutdown(jobName) && !JobRegistry.getInstance().getJobScheduleController(jobName).isPaused() && isRemoveInstance(path, eventType) && !isReconnectedRegistryCenter()) { schedulerFacade.shutdownInstance(); } }

Steps to reproduce the behavior.

multiple kill the zookeeper cluster and recovery

Example codes for reproduce this issue (such as a github link).

TeslaCN commented 3 years ago

What's your ZookeeperConfiguration? We set the retry policy org.apache.curator.retry.ExponentialBackoffRetry when initializing the Curator client. If retry times exceeded, the instance will not recover.

YANGJINJUE commented 3 years ago

What's your ZookeeperConfiguration? We set the retry policy org.apache.curator.retry.ExponentialBackoffRetry when initializing the Curator client. If retry times exceeded, the instance will not recover the code schedulerFacade.shutdownInstance() may execute when zookeeper cluster session expire

zewade commented 3 years ago

We also have the same problem,when zookeeper connect is not stable,some of the elastic jobs might be shutdown by self. 2021-05-12 11:07:12,276 INFO [userdemo-infra] [main] [org.quartz.core.QuartzScheduler:666] - trace[] Scheduler quartzScheduler_$_NONCLUSTERED shutting down. 2021-05-12 11:07:12,276 INFO [userdemo-infra] [main] [org.quartz.core.QuartzScheduler:585] - trace[] Scheduler quartzScheduler$_NONCLUSTERED paused. 2021-05-12 11:07:12,280 INFO [userdemo-infra] [main] [org.quartz.core.QuartzScheduler:740] - trace[] Scheduler quartzScheduler$_NON_CLUSTERED shutdown complete.

15168326318 commented 2 years ago

the code schedulerFacade.shutdownInstance() may execute when zookeeper cluster session expire @YANGJINJUE @zewade @TeslaCN Has this problem been solved?We also have the same problem.

ExploreHeart commented 9 months ago

RegistryCenterConnectionStateListener.java and InstanceShutdownStatusJobListener.java add Logs;

There is a possibility that a shutdown is triggered when a temporary node is deleted due to session expiration.

20:32:17.898 [Curator-ConnectionStateManager-0] INFO com.dangdang.ddframe.job.lite.internal.listener.RegistryCenterConnectionStateListener - jobName:myJob start pauseJob 20:32:17.898 [nioEventLoopGroup-4-1] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x10157a410ab0036, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,3571,-101 request:: '/myJob/myJob/instances/10.30.65.185@-@22420,F response::
20:32:17.898 [Curator-ConnectionStateManager-0] INFO com.dangdang.ddframe.job.lite.internal.listener.RegistryCenterConnectionStateListener - jobName:myJob end pauseJob 20:32:17.898 [myJob_QuartzSchedulerThread] DEBUG org.quartz.core.QuartzSchedulerThread - batch acquisition of 0 triggers 20:32:17.898 [Curator-TreeCache-0] INFO com.dangdang.ddframe.job.lite.internal.instance.ShutdownListenerManager - start shutdownInstance jobName:myJob, path: /myJob/instances/10.30.65.185@-@22420, data: 20:32:17.898 [Curator-ConnectionStateManager-0] DEBUG org.apache.curator.framework.recipes.cache.TreeCache - publishEvent: TreeCacheEvent{type=CONNECTION_LOST, data=null} 20:32:17.899 [Curator-ConnectionStateManager-0] INFO com.dangdang.ddframe.job.lite.internal.listener.RegistryCenterConnectionStateListener - jobName:myJob client state changed to LOST

Steps to reproduce the behavior

org.apache.zookeeper.ClientCnxn.SendThread#sendPing add breakpoints to simulate session expiration and trigger temporary node deletion. There is a possibility that this problem recurs. @TeslaCN