gbif / stackable

GBIF Stackable Infrastructure
Apache License 2.0
4 stars 0 forks source link

Mapnik losing Zookeeper connection #37

Open MattBlissett opened 2 months ago

MattBlissett commented 2 months ago

The UAT2 mapnik servers are regularly losing their Zookeeper connections:

2024-06-13 10:52:38,048:25397:ZOO_WARN@zookeeper_interest@1572: Exceeded deadline by 6669ms
node: ../deps/uv/src/unix/core.c:898: uv__io_stop: Assertion `loop->watchers[w->fd] == w' failed.

Generally they would restart automatically, but with the way I have these set up (temporarily) they do not, so we notice the problem. Could there be an instability with the Zookeeper service?

There are other errors in the logs:

[Server was started at 09:25.]

2024-06-13 09:49:21,538:25397:ZOO_ERROR@handle_socket_error_msg@1748: Socket [130.226.238.143:2282] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2024-06-13 09:49:21,538:25397:ZOO_ERROR@zk_io_cb@322: yield:zookeeper_process returned error: -4 - connection loss

2024-06-13 09:49:21,539:25397:ZOO_ERROR@handle_socket_error_msg@1724: Socket [130.226.238.141:2282] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2024-06-13 09:49:21,539:25397:ZOO_ERROR@zk_io_cb@322: yield:zookeeper_process returned error: -4 - connection loss

2024-06-13 09:55:21,497:25397:ZOO_ERROR@handle_socket_error_msg@1748: Socket [130.225.43.186:2282] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2024-06-13 09:55:21,497:25397:ZOO_ERROR@zk_io_cb@322: yield:zookeeper_process returned error: -4 - connection loss

2024-06-13 09:55:21,501:25397:ZOO_ERROR@handle_socket_error_msg@1748: Socket [130.226.238.142:2282] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2024-06-13 09:55:21,501:25397:ZOO_ERROR@zk_io_cb@322: yield:zookeeper_process returned error: -4 - connection loss

2024-06-13 09:55:21,506:25397:ZOO_ERROR@handle_socket_error_msg@1748: Socket [130.225.43.184:2282] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2024-06-13 09:55:21,506:25397:ZOO_ERROR@zk_io_cb@322: yield:zookeeper_process returned error: -4 - connection loss

2024-06-13 10:20:30,283:25397:ZOO_ERROR@handle_socket_error_msg@1748: Socket [130.226.238.143:2282] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2024-06-13 10:20:30,284:25397:ZOO_ERROR@zk_io_cb@322: yield:zookeeper_process returned error: -4 - connection loss

2024-06-13 10:52:31,381:25397:ZOO_ERROR@handle_socket_error_msg@1668: Socket [130.226.238.141:2282] zk retcode=-7, errno=110(Connection timed out): connection to 130.226.238.141:2282 timed out (exceeded timeout by 1ms)
2024-06-13 10:52:31,381:25397:ZOO_ERROR@yield@261: yield:zookeeper_interest returned error: -7 - operation timeout

2024-06-13 10:52:38,048:25397:ZOO_WARN@zookeeper_interest@1572: Exceeded deadline by 6669ms
node: ../deps/uv/src/unix/core.c:898: uv__io_stop: Assertion `loop->watchers[w->fd] == w' failed.
zaultooz commented 1 month ago

Thanks for the information on the issue. I have previously looked at bit at the issue as the Zookeeper cluster seems to do random reelections and thereby some nodes restarts.

What I am bit puzzled about, is why mapnik doesn't failover to the other servers in the connection quorum.

I will have to investigate more.