fabric8io / fabric8

fabric8 is an open source microservices platform based on Docker, Kubernetes and Jenkins
http://fabric8.io/
1.76k stars 504 forks source link

AMQ master election intermittently not populating address for the master #1332

Open askannon opened 10 years ago

askannon commented 10 years ago

I have a 5 node replicated mq cluster. Everything seems to be working fine. But when I stop/start the master nodes to trigger failover it sometimes happens that the address in the ZK registry under /fabric/registry/clusters/fusemq-replication-elections/dfwx1/000000000123 shows this:

id broker1 container dfwx1-broker1-3 address null position -1 weight 1 elected 0000000123

And on the camel amq endpoints I get this: org.apache.activemq.transport.failover.FailoverTransport: Failed to connect to [] after: 10 attempt(s) continuing to retry.

To fix this I have to stop all brokers in the cluster and restart them to get the address field populated for the master again.

davsclaus commented 10 years ago

What version of fabric is this?

askannon commented 10 years ago

fabric8-karaf-1.0.0.redhat-378

askannon commented 10 years ago

When the master address is not populated the DiscoveryTransport is not adding the new broker URL. Here the the failover log that works:

2014-05-09 10:20:52,807 | WARN  | .164:58862@54577 | FailoverTransport                | sport.failover.FailoverTransport  260 | 100 - org.apache.activemq.activemq-osgi - 5.9.0.redhat-610378 | Transport (tcp://esb1-4-vl.dfwx/10.20.2.164:58862@54577) failed, reason:  , attempting to automatically reconnect
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)[:1.7.0_51]
        at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:258)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:221)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:213)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at java.lang.Thread.run(Thread.java:744)[:1.7.0_51]
2014-05-09 10:20:56,020 | INFO  | ZooKeeperGroup-0 | DiscoveryTransport               | ort.discovery.DiscoveryTransport   78 | 100 - org.apache.activemq.activemq-osgi - 5.9.0.redhat-610378 | Adding new broker connection URL: tcp://esb1-5-vl.dfwx:54542
2014-05-09 10:21:03,116 | INFO  | ActiveMQ Task-8  | FailoverTransport                | sport.failover.FailoverTransport 1057 | 100 - org.apache.activemq.activemq-osgi - 5.9.0.redhat-610378 | Successfully reconnected to tcp://esb1-5-vl.dfwx:54542

and here is next one that doesn't work anymore:

2014-05-09 10:21:36,167 | WARN  | .165:54542@56440 | FailoverTransport                | sport.failover.FailoverTransport  260 | 100 - org.apache.activemq.activemq-osgi - 5.9.0.redhat-610378 | Transport (tcp://esb1-5-vl.dfwx/10.20.2.165:54542@56440) failed, reason:  , attempting to automatically reconnect
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)[:1.7.0_51]
        at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:258)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:221)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:213)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)[100:org.apache.activemq.activemq-osgi:5.9.0.redhat-610378]
        at java.lang.Thread.run(Thread.java:744)[:1.7.0_51]
2014-05-09 10:21:41,281 | WARN  | ActiveMQ Task-10 | FailoverTransport                | sport.failover.FailoverTransport 1109 | 100 - org.apache.activemq.activemq-osgi - 5.9.0.redhat-610378 | Failed to connect to [] after: 10 attempt(s) continuing to retry.
jstrachan commented 10 years ago

btw just a heads up, fabric8-karaf-1.0.0.redhat-379 is the GA version of Fuse 6.1.

StanClowes commented 10 years ago

Hi,

update on this; we are seeing this issue with 379. Tends to happen when we have a set of bundles all connected to AMQ and there is an update to the features repository URL to rollout a new build e.g.

At this point contexts/routes are shutdown, new versions of the features are downloaded and the then the bundles are restarted.

the container routes try to reconnect to AMQ and the following errors occur (depending if we are using kaha or replicated leveldb configuration):

org.apache.activemq.activemq-osgi Failed to connect to [] after: 20 attempt(s) or org.apache.activemq.activemq-osgi - 5.9.0.redhat-610379 | Failed to connect to [nio://10.0.2.15:61616] after: 20 attempt(s)

The following is also seen in the AMQ log of the master broker when the container contexts/routes shutdown:

Transport Connection to: tcp://10.0.2.15:56783 failed: java.io.IOException: Broker BrokerService[broker1] is being stopped

Also of note is that according to JMX there is still a master AMQ node but the zookeeper registry has no replication details.

When we have kaha rather that leveldb configured we the null address as above.

regards Stan