Gossip between two Docker images

crustyratfink commented 6 years ago

Gentics Mesh Version, operating system, or hardware.

v0.22.1
Docker image gentics/mesh:latest (0.22.1)

Operating System

Linux

Problem

Running the Docker container in a cluster results in good stuff:

+---------------------------------+------+-----------------------+-----+---------+------------------+------------------+--------------------------+
|Name                             |Status|Databases              |Conns|StartedOn|Binary            |HTTP              |UsedMemory                |
+---------------------------------+------+-----------------------+-----+---------+------------------+------------------+--------------------------+
|test2@0-22-1(*)(@)|ONLINE|storage=ONLINE (MASTER)|2    |16:49:32 |172.31.0.157:2424 |172.31.0.157:2480 |149.08MB/878.50MB (16.97%)|
|test1@0-22-1      |ONLINE|storage=ONLINE (MASTER)|2    |16:49:34 |172.31.12.227:2424|172.31.12.227:2480|20.37MB/483.38MB (4.21%)  |
+---------------------------------+------+-----------------------+-----+---------+------------------+------------------+--------------------------+

and then a whole bunch of bad stuff:

2018-09-02 16:50:56:052 WARNI [test2@0-22-1]->[test1@0-22-1] Server 'test1@0-22-1' did not respond to the gossip message (db=storage, timeout=10000ms), but cannot be set OFFLINE by configuration
2018-09-02 16:50:56:056 WARNI [test2@0-22-1]->[test1@0-22-1] Error on sending message to distributed node (java.net.SocketException: Broken pipe (Write failed)) retrying (1/3)
2018-09-02 16:50:56:057 WARNI [test2@0-22-1]->[test1@0-22-1] Error on reconnecting to distributed node (java.net.ConnectException: Connection refused (Connection refused))
2018-09-02 16:50:56:058 WARNI [test2@0-22-1]->[test1@0-22-1] Error on sending message to distributed node (java.net.SocketException: Socket closed) retrying (2/3)
2018-09-02 16:50:56:459 WARNI [test2@0-22-1]->[test1@0-22-1] Error on reconnecting to distributed node (java.net.ConnectException: Connection refused (Connection refused))
2018-09-02 16:50:56:459 WARNI [test2@0-22-1]->[test1@0-22-1] Error on sending message to distributed node (java.net.SocketException: Socket closed) retrying (3/3)
2018-09-02 16:50:57:060 WARNI [test2@0-22-1]->[test1@0-22-1] Error on reconnecting to distributed node (java.net.ConnectException: Connection refused (Connection refused))
2018-09-02 16:50:57:061 SEVER [test2@0-22-1]->[test1@0-22-1] Error on sending distributed request id=0.6 task=gossip timestamp: 1535907056055 lockManagerServer: test2@0-22-1 (err=Connection refused (Connection refused)). Active nodes: [test2@0-22-1, test1@0-22-1]

Reproducer

Set up two AWS instances with mesh docker containers running, same security group/subnet/etc. Specify local ips for cluster nodes in config and --host=net on the Docker containers.

Expected behaviour and actual behaviour

Expected... complete bootstrap and go online. Actual... hang on up gossip errors and connections refused.

Now, I should say that I know that the Docker image is provided but not supported. I'm hoping that the information I've posted might indicate an obvious configuration issue that would help the growing number of people who deploy this way or help if I go the route of installing everything bare metal.

Jotschi commented 6 years ago

Thanks for the report. I think the issue arises because the wrong IP is announced to other hosts. Thus the hosts can form a cluster.

Is 172.31.0.157 the docker container network IP or the Host IP?
Do you mean two different EC2 instances by "AWS instances"?
Did you add the needed ports to the security group whitelist? See https://getmesh.io/docs/beta/clustering.html#_port_mapping

crustyratfink commented 6 years ago

Hi Johannes.

That is the private IP of the EC2 instance (yes, sorry, that's what I meant). There are two in the same security group, and I've added a rule to allow all traffic internally. The Docker --host=net should expose all of the ports on the host.

Update: It would appear that running the container as --privileged has gotten me past that problem. The two instances are running behind a load balancer, configured as described. It doesn't not appear to be clustering, though. I add a node, hit reload, and it's gone. Hit it again, and it's back. I'm getting different results from the two nodes, though it says encouraging things like:

+---------------------------------+------+-----------------------+-----+---------+------------------+------------------+--------------------------+
|Name                             |Status|Databases              |Conns|StartedOn|Binary            |HTTP              |UsedMemory                |
+---------------------------------+------+-----------------------+-----+---------+------------------+------------------+--------------------------+
|test2@0-22-1(*)(@)|ONLINE|storage=ONLINE (MASTER)|0    |20:49:49 |172.31.0.157:2424 |172.31.0.157:2480 |52.72MB/878.50MB (6.00%)  |
|test1@0-22-1      |ONLINE|storage=ONLINE (MASTER)|4    |20:49:51 |172.31.12.227:2424|172.31.12.227:2480|164.88MB/878.50MB (18.77%)|
+---------------------------------+------+-----------------------+-----+---------+------------------+------------------+--------------------------+

As far as I can tell from the logs, it's working, but alas, it is not a cluster. It's just two independent instances doing their own thing.

Thanks for your response. Looks like I'll either need to approach this by clustering/debugging the individual pieces or pitch it. Meanwhile, I'm going outside while the sun is still out.

crustyratfink commented 6 years ago

I'll just drop a follow-up here. It is clustering. /api/v1/admin/cluster/status shows:

{
"instances": [
{
"address": "172.31.0.157:2424",
"name": "test2",
"status": "ONLINE",
"startDate": "2018-09-03T01:16:48Z"
},
{
"address": "172.31.12.227:2424",
"name": "test1",
"status": "ONLINE",
"startDate": "2018-09-03T01:16:47Z"
}
]
}

And, if I try to add anything twice (e.g., a folder slug), it won't let me. Still, when I reload the project page, for instance, half the time folders show up, half the time they don't. If I create one folder on one node and another on the other one, they alternate when I reload (because the load balancer is switching back and forth).

So, it's half-clustering. If there's such a thing.

YAU (Yet another update): Indeed, though when I add nodes, they show in the UI sporadically and dependent on the node chosen by the load balancer, the same does not appear to happen with the API. That is, it seems, both from graphiql and my tests that I can access any node through the load balancer.

Jotschi commented 6 years ago

Half-clustering is a thing if you have not setup clustering for Elasticsearch. It is required to setup clustering for Mesh and for ES. Gentics Mesh will not configure clustering for Elasticsearch on its own. You need to configure it manually. Maybe you missed that step?

But since you also experience this issue with GraphQL I assume that you may have setup two different Mesh DB instances. It is required to only setup one DB and let the other nodes join this cluster.

Maybe this compose example will help you: https://github.com/gentics/mesh-compose/tree/clustering

Only the first node in the cluster may use the MESH_CLUSTER_INIT environment variable. If you add this variable or the -initCluster argument to the mesh command line it will create multiple databases within your cluster. In that case a ping-pong effect can take place which will cause switches between databases.

crustyratfink commented 6 years ago

Got it. Thanks for the information.

gentics / mesh