Arbitrator (garbd) node incorrectly reports SSL error as "connection timeout"

Let's say that you have a cluster in which SSL encryption is being used for Galera replication traffic. If you try to add an arbitrator (garbd) node to the cluster, and if the arbitrator is not configured with the proper socket.ssl_* options set in wsrep_provider_options, then the arbitrator will not be able to join the cluster due to SSL errors, which makes complete sense.

Unfortunately, when this happens, the arbitrator's log does not make it clear that it failed to connect due to SSL errors. Instead, the log suggests that the arbitrator node actually timed out.

For example, here is the error log from an arbitrator node that does not have SSL enabled that belongs to a cluster that does use SSL:

Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: Starting garbd
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.156  INFO: CRC-32C: using hardware acceleration.
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.156  INFO: Read config:
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: daemon:  0
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: name:    garb
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: address: gcomm://172.30.0.249:4567,172.30.0.46:4567,172.30.0.32:4567
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: group:   my_wsrep_cluster
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: sst:     trivial
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: donor:
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: cfg:
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: log:
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.157  INFO: protonet asio version 0
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.157  INFO: Using CRC-32C for message checksums.
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  INFO: backend: asio
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  INFO: gcomm thread scheduling priority set to other:0
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  WARN: access file(./gvwstate.dat) failed(No such file or directory)
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  INFO: restore pc from disk failed
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  INFO: GMCast version 0
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.158  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.159  INFO: EVS version 0
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.159  INFO: gcomm: connecting to group 'my_wsrep_cluster', peer '172.30.0.249:4567,172.30.0.46:4567,172.30.0.32:4567'
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.160  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection established to fbff02f5 tcp://172.30.0.32:4567
Jan  2 16:00:38 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:38.160  WARN: (fbff02f5, 'tcp://0.0.0.0:4567') address 'tcp://172.30.0.32:4567' points to own listening address, blacklisting
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:41.160  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.249:4567 timed out, no messages seen in PT3S
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:41.160  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer fbff02f5 with addr tcp://172.30.0.32:4567 timed out, no messages seen in PT3S
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:41.160  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.46:4567 timed out, no messages seen in PT3S
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:41.160  WARN: no nodes coming from prim view, prim not possible
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:41.160  INFO: view(view_id(NON_PRIM,fbff02f5,1) memb {
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: fbff02f5,0
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: } joined {
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: } left {
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: } partitioned {
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: })
Jan  2 16:00:41 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:41.660  WARN: last inactive check more than PT1.5S ago (PT3.50168S), skipping check
Jan  2 16:00:45 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:45.161  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.249:4567 timed out, no messages seen in PT3S
Jan  2 16:00:48 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:48.161  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.46:4567 timed out, no messages seen in PT3S
Jan  2 16:00:51 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:51.162  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.249:4567 timed out, no messages seen in PT3S
Jan  2 16:00:54 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:54.162  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.46:4567 timed out, no messages seen in PT3S
Jan  2 16:00:57 ip-172-30-0-32 garb-systemd: 2018-01-02 16:00:57.163  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.249:4567 timed out, no messages seen in PT3S
Jan  2 16:01:00 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:00.164  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.46:4567 timed out, no messages seen in PT3S
Jan  2 16:01:03 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:03.164  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.249:4567 timed out, no messages seen in PT3S
Jan  2 16:01:06 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:06.165  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.46:4567 timed out, no messages seen in PT3S
Jan  2 16:01:09 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:09.165  INFO: (fbff02f5, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.30.0.249:4567 timed out, no messages seen in PT3S
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:11.169  INFO: view((empty))
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:11.169 ERROR: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: at gcomm/src/pc.cpp:connect():158
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:11.169 ERROR: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
Jan  2 16:01:11 ip-172-30-0-32 systemd: garb.service: main process exited, code=exited, status=1/FAILURE
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:11.170 ERROR: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'my_wsrep_cluster' at 'gcomm://172.30.0.249:4567,172.30.0.46:4567,172.30.0.32:4567': -110 (Connection timed out)
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: 2018-01-02 16:01:11.170 FATAL: Exception in creating receive loop: Failed to open connection to group: 110 (Connection timed out)
Jan  2 16:01:11 ip-172-30-0-32 garb-systemd: at garb/garb_gcs.cpp:Gcs():35
Jan  2 16:01:11 ip-172-30-0-32 systemd: Unit garb.service entered failed state.
Jan  2 16:01:11 ip-172-30-0-32 systemd: garb.service failed.

The messages in the error log appear to be incorrect to me. They should probably report that an SSL error occurred, rather than a connection timeout.

Since the arbitrator's log doesn't really make much sense in this instance, the user might think to check the error logs from the cluster nodes to get a hint of what is happening. Unfortunately, the error logs from the cluster nodes (running MariaDB 10.1) usually do not appear to contain any relevant log messages either when SSL-related errors occur. However, if the error was specifically caused by an incompatible cipher choice, then the cluster node's log might contain an error message like the following:

2018-01-02 14:58:31 139824916838144 [ERROR] WSREP: handshake with remote endpoint ssl://10.1.1.10:49840 failed: asio.ssl:336109761: 'no shared cipher' ( 336109761: 'error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher')

codership / galera

Arbitrator (garbd) node incorrectly reports SSL error as "connection timeout" #495