aweber / rabbitmq-autocluster

This project is now maintained by the RabbitMQ Team, visit the official repo @
https://github.com/rabbitmq/rabbitmq-autocluster
BSD 3-Clause "New" or "Revised" License
336 stars 120 forks source link

Rabbitmq not clustered #16

Closed amitgilad3 closed 8 years ago

amitgilad3 commented 9 years ago

Hi, I managed to install rabbitmq and add this plugin. When I run everything it is registered with consul but my rabbitmq is not clustered .

Example: If I add a user on one machine it is not added on other machines.

gmr commented 9 years ago

Sorry you're having problems. You've not given me much to info to help you. 1) Which version? 2) If version 0.4.0, which backend? 3) Have you looked at the rabbitmq logs? Please provide the startup logs for both nodes.

amitgilad3 commented 9 years ago

Hi gmr, 1.rabbit version is 3.5.4 2.rabbitmq-autocluster 0.4.0 with consul backend 3.the first rabbit is registered add works fine but the second rabbit shows me the following error:

=INFO REPORT==== 9-Aug-2015::08:58:22 === autocluster: Node appears to be the first in the cluster

=ERROR REPORT==== 9-Aug-2015::08:56:37 === autocluster: HTTP Response (500) CheckID does not have associated TTL

=ERROR REPORT==== 9-Aug-2015::08:56:37 === autocluster: Error updating Consul health check: "500"

rabbitmq.config(same for every node): [{autocluster, [ {consul_host, "localhost"}, {consul_port, 8500}, {consul_service, "rabbitmq-reportService"}, {cluster_name, "reportService"} ]} ].

gmr commented 9 years ago

What version of Consul are you using? That'd tell me that the API is not compatible.

amitgilad3 commented 9 years ago

consul version 0.5.2

gmr commented 9 years ago

I think the problem is with the consul_service value, can you try it with just rabbitmq or remove it from the config. It will add the tag of with the value of cluster_name for making sure it's not mixed with others

amitgilad3 commented 9 years ago

Hi gmr,

i just left this: [{autocluster, [ {consul_host, "localhost"}, {consul_port, 8500} ]} ].

and only the first one the arrive works ,all other nodes get the same error

=ERROR REPORT==== 9-Aug-2015::20:33:16 === autocluster: HTTP Response (500) CheckID does not have associated TTL

=ERROR REPORT==== 9-Aug-2015::20:33:16 === autocluster: Error updating Consul health check: "500"

gmr commented 9 years ago

Anything in the consul logs? I'm running this release in production and I do not have any such issues. With that as your only config, you do not need any, fwiw. It seems like the health check being submitted for the node is not correct. Only other thing I can think to check would be erlang version, maybe something wrong with assumptions about the httpc/inets library.

amitgilad3 commented 9 years ago

first of all i would just like to say thanks for the quick replies and also that i think that this plugin is awesome. i am sure that we will find th cause of the issue.

consul log: 2015/08/09 20:47:31 [ERR] http: Request /v1/agent/check/pass/service:rabbitmq, error: CheckID does not have associated TTL

the erlang version i am using is : 18.0

please let me know what version you are using

gmr commented 9 years ago

I'm using R18 as well. I'll add some debugging shortly after I do some yard work, but if you get the chance, can you set consul_service_ttl to 60 or CONSUL_SERVICE_TTL to 60 as an environment variable? It'd also be useful to see the response from curl http://localhost:8500//v1/health/service/rabbitmq

amitgilad3 commented 9 years ago

Hi gmr, i set consul_service_ttl to 60 and i did curl http://localhost:8500//v1/health/service/rabbitmq the result that i got was for the first rabbitmq that reached consul.

the issue is that if i create a cluster with 3 rabbit's only the first one to reach consul gets registered and the other 2 get the error:

=ERROR REPORT==== 9-Aug-2015::20:33:16 ===
autocluster: HTTP Response (500) CheckID does not have associated TTL

=ERROR REPORT==== 9-Aug-2015::20:33:16 ===
autocluster: Error updating Consul health check: "500

maybe i need to add some configuration to the rabbitmq??(i just install the rpm and add rabbitmq.config)

here is the response from the curl request:

[
    {
        "Node": {
            "Node": "rabbitmqreportservice-172.31.17.237",
            "Address": "172.31.17.237"
        },
        "Service": {
            "ID": "rabbitmq",
            "Service": "rabbitmq",
            "Tags": null,
            "Address": "",
            "Port": 5672
        },
        "Checks": [
            {
                "Node": "rabbitmqreportservice-172.31.17.237",
                "CheckID": "service:rabbitmq",
                "Name": "Service 'rabbitmq' check",
                "Status": "passing",
                "Notes": "RabbitMQ Auto-Cluster Plugin TTL Check",
                "Output": "",
                "ServiceID": "rabbitmq",
                "ServiceName": "rabbitmq"
            },
            {
                "Node": "rabbitmqreportservice-172.31.17.237",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": ""
            }
        ]
    }
]```
enjoy the yard work :)
gmr commented 9 years ago

I've just released 0.4.1 that adds configurable logging. Add autocluster to your rabbitmq.config like so:

    [{rabbit, [
      {log_levels, [{autocluster, debug}, {connection, info}]}
    ]}].

And you should get debug logging information from the plugin about what it's submitting and what the replies are, that should get us cluster.

This is what I see when I talk to consul locally BTW:

curl -v "http://localhost:8500/v1/health/service/rabbitmq" | json_pp

* Connection #0 to host localhost left intact
[
   {
      "Checks" : [
         {
            "CheckID" : "service:rabbitmq",
            "ServiceID" : "rabbitmq",
            "ServiceName" : "rabbitmq",
            "Output" : "",
            "Status" : "passing",
            "Notes" : "RabbitMQ Auto-Cluster Plugin TTL Check",
            "Node" : "gmr-home.local",
            "Name" : "Service 'rabbitmq' check"
         },
         {
            "CheckID" : "serfHealth",
            "Name" : "Serf Health Status",
            "Output" : "Agent alive and reachable",
            "ServiceName" : "",
            "ServiceID" : "",
            "Status" : "passing",
            "Notes" : "",
            "Node" : "gmr-home.local"
         }
      ],
      "Service" : {
         "Tags" : null,
         "Address" : "",
         "ID" : "rabbitmq",
         "Service" : "rabbitmq",
         "Port" : 5672
      },
      "Node" : {
         "Node" : "gmr-home.local",
         "Address" : "192.168.2.2"
      }
   }
]

And here's my log output:


=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: GET http://localhost:8500/v1/health/service/rabbitmq?passing

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Sun, 09 Aug 2015 22:51:36 GMT"},
                              {"content-length","2"},
                              {"content-type","application/json"},
                              {"x-consul-index","28"},
                              {"x-consul-knownleader","true"},
                              {"x-consul-lastcontact","0"}],
                             "[]"}}]

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Registering node with consul

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: POST http://localhost:8500/v1/agent/service/register ["{\"ID\":\"rabbitmq\",\"Name\":\"rabbitmq\",\"Port\":5672,\"Check\":{\"Notes\":\"RabbitMQ Auto-Cluster Plugin TTL Check\",\"TTL\":\"30s\"}}"]

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Sun, 09 Aug 2015 22:51:36 GMT"},
                              {"content-length","0"},
                              {"content-type","text/plain; charset=utf-8"}],
                             []}}]

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Registered node

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Node is only node in the cluster

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Node appears to be the first in the cluster

...

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
autocluster: Starting Consul Health Check TTL Timer

=INFO REPORT==== 9-Aug-2015::18:51:36 ===
Server startup complete; 1 plugins started.
 * autocluster

=INFO REPORT==== 9-Aug-2015::18:51:51 ===
autocluster: GET http://localhost:8500/v1/agent/check/pass/service%3Arabbitmq

=INFO REPORT==== 9-Aug-2015::18:51:51 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Sun, 09 Aug 2015 22:51:51 GMT"},
                              {"content-length","0"},
                              {"content-type","text/plain; charset=utf-8"}],
                             []}}]

=INFO REPORT==== 9-Aug-2015::18:52:06 ===
autocluster: GET http://localhost:8500/v1/agent/check/pass/service%3Arabbitmq

=INFO REPORT==== 9-Aug-2015::18:52:06 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Sun, 09 Aug 2015 22:52:06 GMT"},
                              {"content-length","0"},
                              {"content-type","text/plain; charset=utf-8"}],
                             []}}]

...
gmr commented 9 years ago

Oh and it'd be more useful to see the output of that Consul request on the failing nodes (if that wasn't on one of the failing nodes).

amitgilad3 commented 9 years ago

Hi grm,

the output i showed you was from the failed node.

i attached 2 files :

  1. first_rabbit -contains log of first rabbit to go up - http://www.megafileupload.com/4Wmm/first_rabbit.txt
  2. second_rabbit -contains log of second rabbit to go up http://www.megafileupload.com/4Wmn/second_rabbit.txt

just download the files from the links

thanks :)

gmr commented 9 years ago

So it looks like it's something to do with your node names. Both nodes are named:

rabbit@rabbitmqreportservice-172

with regard to how RabbitMQ/Erlang deals with node names.

There are also other nodes registered in Consul, rabbit@rabbitbd-2i0lixhie4s2seq, which is returning health check info that is not related to autocluster that it is trying to cluster with.

I'd make sure that each node has a reasonable fqdn and it might be worth trying setting the RABBITMQ_USE_LONGNAME environment variable to true. Also, I'd remove all dead nodes from consul. if rabbit@rabbitbd-2i0lixhie4s2seq is a valid node from another cluster, then I'd set the cluster name to something so it does not get returned in the results.

amitgilad3 commented 9 years ago

Hi gmr,

i did what you said and now they are showing in consul .

but i am not sure that they are actually clustered.

if i create a user or queue on one node it not created on the second node.

what should i do??

how can i verify that the rabbitmq is truly clustered?

gmr commented 9 years ago

If you're not installing the management UI plugin, you might want to activate that so you can access the web interface.

You can also use rabbitmqctl on one of the nodes to get the cluster status: rabbitmqctl cluster_status

amitgilad3 commented 9 years ago

i just checked it and this is the result Cluster status of node 'rabbitmqreportservice172-31-19-56@rabbitmqreportservice-172.31.19.56' ... [{nodes,[{disc,['rabbitmqreportservice172-31-19-56@rabbitmqreportservice-172.31.19.56']}]}, {running_nodes,['rabbitmqreportservice172-31-19-56@rabbitmqreportservice-172.31.19.56']}, {cluster_name,<"rabbitmqreportservice172-31-19-56@rabbitmqreportservice-172">}, {partitions,[]}]

as you can see , only one node is registered

gmr commented 9 years ago

Your node names are off for the plugin to work, you can't change the rabbit@ bit in the current version. What setting did you use to get it that way?

scalp42 commented 8 years ago

@gmr I'm pretty sure he changed the NODENAME variable.

hamx0r commented 8 years ago

I'm having a similar issue, but I have done nothing to set hostnames when using the alpine-rabbitmq-autocluster Docker Image (ie they are auto-set to values like rabbit@e9bd0b21c5af and rabbit@edae08d9e0bc. When I first start a pair of such Containers, they each register as their own cluster. If i restart them, they log this error:

=INFO REPORT==== 15-Feb-2016::21:33:56 ===
autocluster: Registering node with consul

=ERROR REPORT==== 15-Feb-2016::21:34:01 ===
autocluster: Can not communicate with cluster nodes: [rabbit@node1]

This is odd since nothing is configured as rabbit@node1. If there's a known solution to this kind of problem, here's a serverfault question about this.

EDIT The above was using a single Consul server on a separate machine from the two RMQ machines. I've tried this again using a Consul Container running on the same machines running each of the RMQ instances to act as a Consul Client. The RMQ instances will start and register with their co-hosted Consul Client. Both Consul Clients are connected to the same Consul Server. When starting one of the RMQ instances after enough time has elapsed for the 1st RMQ instance to fully register with Consul, we see this:

docker logs rmq2 | grep autoclusterautocluster: Registering node with consul 
autocluster: Can not communicate with cluster nodes: [rabbit@192]            
autocluster: Starting Consul Health Check TTL Timer          

It looks like Consul is registering each RMQ instance using it's IP address for the hostname, and because there's a . in it, it thinks it's an FQDN. If I set RABBITMQ_USE_LONGNAME to true, RMQ fails to boot with this output.

jbwinters commented 8 years ago

+1, same issues as @hamx0r

DerykHopley commented 8 years ago

Hi gmr.

I'm also having a similar issue where the nodes aren't clustering. My hostnames are FQDN and longname is true. I am on virtualbox and ports are open.

I am using the etcd backend though and not consul. Etcd is being populated fine.

Version: 0.4.1 Backend: etcd RabbitMQ 3.5.6 docker --version Docker version 1.10.0, build 590d5108 etcd --version etcd Version: 2.1.1 erlang.cookie: same Docker container: gavinmroy/alpine-rabbitmq-autocluster

etcdctl ls /rabbitmq/default /rabbitmq/default/dev1a.domain.net /rabbitmq/default/dev1b.domain.net

docker run --name rabbitmqcluster -d -h dev1b.domain.net -e RABBITMQ_USE_LONGNAME=true -e AUTOCLUSTER_TYPE=etcd -e ETCD_SCHEME=http -e ETCD_HOST=192.168.10.205 -e ETCD_PORT=2379 -e ETCD_PREFIX=rabbitmq -e ETCD_TTL=30 -p 4369:4369 -p 5672:5672 -p 15672:15672 -p 25672:25672 gavinmroy/alpine-rabbitmq-autocluster

rabbitmq.config [ {rabbit, [ {loopback_users, []}, {cluster_partition_handling, autoheal}, {delegate_count, 64}, {fhc_read_buffering, false}, {fhc_write_buffering, false}, {heartbeat, 60}, {queue_index_embed_msgs_below, 0}, {queue_index_max_journal_entries, 8192}, {log_levels, [ {autocluster, debug}, {connection, debug}, {channel, warning}, {federation, info}, {mirroring, info} ]}, {vm_memory_high_watermark, 0.8} ]}, {rabbitmq_management, [{rates_mode, basic}]}, {autocluster, [ {backend, "etcd"}, {etcd_host, "192.168.10.205"}, {etcd_port, 2379}, {etcd_scheme, "http"}, {etcd_prefix, "rabbitmq"}, {etcd_ttl, 30} ]} ].

container1.txt container2.txt

Regards Deryk

gmr commented 8 years ago

Closing the loop here, changing nodename will not be supported until 0.5.0.

MagicStarTrace commented 7 years ago

2016/11/29 04:24:16 Unexpected response code: 500 (CheckID does not have associated TTL)