Intel-bigdata / SSM

Smart Storage Management for Big Data, a comprehensive hot/cold data optimized solution
Apache License 2.0
133 stars 67 forks source link

Restart active server causes itself dead occasionally #2230

Open lipppppp opened 3 years ago

lipppppp commented 3 years ago

After restarting active server on ssm1, the service started normally. But the node info page shows that the status of ssm1 is dead, and cmdlets cannot run on ssm1. This problem is accidental, repeated many times the problem will appear. When ssm1 stoped, there are some error messages in the log. image image image image image

lipppppp commented 3 years ago

In this case, it is still dead after restarting the service on ssm1. And there is no problem in the log. Only after the active server is restarted can it return to normal. image

PHILO-HE commented 3 years ago

I cannot reproduce this issue. You can try to debug it. The exception reported in shutting down doesn't matter I think. HazelcastExecutorService#addMember will add newly started SSM server and deliver message to CmdletDispatcherHelper for further handling, which may be helpful in your debugging.

lipppppp commented 3 years ago

OK, I will try to debug this process. I found sometimes the state of standby server is normal, but all the tasks occured timeout in this case when there is no agent node in cluster. image image