archiver-appliance / epicsarchiverap

This is an implementation of an archiver for EPICS control systems that aims to archive millions of PVs.
Other
40 stars 39 forks source link

Node fails to start when cluster is used but not for single instance #80

Closed DanielALS closed 5 years ago

DanielALS commented 5 years ago

I'm trying to add another AA instance on a host that will be part of a cluster of AA's. The original two appliances work fine and talk to each other.

I performed the new install using the "single-machine" install script, with a modified appliances.xml which contains all other instances.

Any help trouble the addition of a new cluster instance is appreciated, as well, there might be an opportunity to include more details in the exception message.

Cheers

The following exception shows up in my arch.log file a few minutes after starting the appliance via the start script.

10055 [Startup executor] INFO config.org.epics.archiverappliance.config.DefaultConfigService - Post startup for MGMT 10407 [Startup executor] INFO config.org.epics.archiverappliance.config.DefaultConfigService - Setting my cluster port base to 16670 and using interface X.X.196.50 # redacted the real IP 312753 [Startup executor] ERROR org.epics.archiverappliance.mgmt.MgmtPostStartup - Exception running post startup on the management app org.epics.archiverappliance.config.exception.ConfigException: Exception adding member to cluster at org.epics.archiverappliance.config.DefaultConfigService.postStartup(DefaultConfigService.java:530) at org.epics.archiverappliance.mgmt.MgmtPostStartup.run(MgmtPostStartup.java:44) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: Node failed to start! at com.hazelcast.instance.HazelcastInstanceImpl.<init>(HazelcastInstanceImpl.java:140) at com.hazelcast.instance.HazelcastInstanceFactory.constructHazelcastInstance(HazelcastInstanceFactory.java:196) at com.hazelcast.instance.HazelcastInstanceFactory.newHazelcastInstance(HazelcastInstanceFactory.java:175) at com.hazelcast.instance.HazelcastInstanceFactory.newHazelcastInstance(HazelcastInstanceFactory.java:125) at com.hazelcast.core.Hazelcast.newHazelcastInstance(Hazelcast.java:57) at org.epics.archiverappliance.config.DefaultConfigService.postStartup(DefaultConfigService.java:528) ... 8 more 312756 [Startup executor] INFO config.org.epics.archiverappliance.config.DefaultConfigService - Webapp is not in correct state for postStartup MGMT. It is in POST_STARTUP_RUNNING 312756 [Startup executor] INFO config.org.epics.archiverappliance.mgmt.MgmtPostStartup - Finished post startup for the mgmt webapp 312756 [Startup executor] INFO config.org.epics.archiverappliance.config.DefaultConfigService - Webapp is not in correct state for postStartup MGMT. It is in POST_STARTUP_RUNNING 312756 [Startup executor] INFO config.org.epics.archiverappliance.mgmt.MgmtPostStartup - Finished post startup for the mgmt webapp

slacmshankar commented 5 years ago

Can you make sure there are no port conflicts in the appliances.xml? If possible, please attach a copy of your appliances.xml.

DanielALS commented 5 years ago

Thanks, I'm going have our network people check for conflicts. I don't feel comfortable posting the IP addresses here. I did diff the potential problem xml with the production copy and they are identical (which is good).

Assuming I get a final error determined, is it possible to come up with a more detailed exception message ? I'll try my hand at a PR, I won't feel bad if you reject it.

Cheers

slacmshankar commented 5 years ago

No worries. Make sure the ports do not conflict with anything else.. There is not much information as to why Hz did not start; but most of the time this has to do with port conflicts and the like.

DanielALS commented 5 years ago

I can't find any port conflicts. The host I'm running my appliance on also runs Phoebus. Do clustered Appliances need to use the exact same snap shot ? That might be an issue.

slacmshankar commented 5 years ago

Do clustered Appliances need to use the exact same snap shot ? I would say yes. I tend to upgrade the underlying clustering jars a little bit more frequently than the rest. And the internal protocols (specific to clustering) do change with versions of the jar; so I would lean towards using the same version for the cluster.

DanielALS commented 5 years ago

I installed the exact same snapshot and tomcat versions. Clustering now works.

I didn't separately test matching AA snapshot and matching Tomcat versions though.I suppose it's somewhat obvious as a best practice, but if you forget to check, it could be a gotcha.