ES Index API is unreachable even if there is reachable hosts in elasticsearch_hosts variable

tkuronen commented 4 years ago

Expected Behavior

Graylog should skip unreachable hosts defined in elasticsearch_hosts variable and try to call API of other hosts instead.

Current Behavior

Two hosts in elasticsearch_hosts variable, eg.: elasticsearch_hosts = http://es-coord001.example.com:9200,http://es-coord002.example.com:9200

If one of the hosts defined in elasticsearch_hosts is down (in our case it was first one on the list), there comes following errors in to server.log:

2020-04-15T00:00:02.669+03:00 ERROR [IndexFieldTypePoller] Couldn't get mapping for index <xyz_524>: No route to host (Host unreachable).

2020-04-15T00:00:02.669+03:00 ERROR [IndexRotationThread] Couldn't point deflector to a new index … Caused by: java.net.NoRouteToHostException: No route to host (Host unreachable)

2020-04-15T00:00:05.675+03:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set …. Caused by: java.net.NoRouteToHostException: No route to host (Host unreachable)

2020-04-15T00:00:26.717+03:00 ERROR [Messages] Caught exception during bulk indexing: java.net.NoRouteToHostException: No route to host (Host unreachable), retrying (attempt #1).

Also System > Indices -page in UI gives the following errors and do not list any index sets:

ui_error

Tried also to change unresponding host in elasticsearch_hosts something not even found in DNS and that host was skipped. So problem exists if DNS entry for the host exists but host is not responding to port 9200.

Possible Solution

Unresponding hosts should be skipped just like the ones with missing DNS entry.

Steps to Reproduce (for bugs)

Set for example two hosts to Graylog’s config variable elasticsearch_hosts and ensure that page System > Indices is listing all the index sets.
Shut down either one of the hosts and check the System > Indices page again.

Context

Situation was noticed when one or our ES coordinating nodes defined in elasticsearch_hosts was down for a maintenance and I tried to open the System > Indices page during that.

Your Environment

Graylog Version: 3.2.4
Elasticsearch Version: 6.8.8
MongoDB Version: 4.0.17
Operating System: RHEL 7.7

tkuronen commented 4 years ago

Here is also a error message from Chrome console:

jaudriga commented 4 years ago

I have the same issue on Graylog container 3.3.1-1, Elasticsearch container 6.8.10 and MongoDB container 4.2.8 on Ubuntu 20.04.

I reproduced it by taking down one of three/two Elasticsearch nodes.

You seem to have had the same issue in the past (see https://github.com/Graylog2/graylog2-server/issues/3993 ). So, maybe this is a regression introduced by https://github.com/Graylog2/graylog2-server/pull/4741 ?

jaudriga commented 4 years ago

With Elasticsearch discovery enabled the issue is gone after the following line appears in the logs:

2020-08-11 12:05:25,437 INFO : io.searchbox.client.AbstractJestClient - Setting server pool to a list of 2 servers: [http://10.0.0.4:9200,http://10.0.0.46:9200]

I therefore presume that the issue is only present as long as one of the servers in the list is down. It should resolve itself within elasticsearch_discovery_frequency (30 s by default).

Graylog2 / graylog2-server