Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.32k stars 1.05k forks source link

SecDev EC2 test instance datanode migration failure #19900

Closed ryan-carroll-graylog closed 1 month ago

ryan-carroll-graylog commented 1 month ago

Attempted to run an in-place data node migration on on the green.secdev.torch.sh EC2 test instance using 6.1 alpha-4 Graylog and data node builds.

  1. Progress appeared to be going smoothly until the final step of the process which hung for 20-30 mins with datat node never becoming Available.
  2. At which point I noticed the Next button being active and (possibly erroneously?) took that to mean it was time to restart the Graylog server, which I did.
  3. Upon Graylog restart got a ton of OS connection errors
  4. Tried restarting data node and got various errors

Graylog server logs: graylog-server.log

Data node logs:

The subsequent logging from the 12th is the error output from non-root datanode restart attempts.

datanode-cluster.log datanode-cluster-2024-07-11-1.json.gz datanode-cluster-2024-07-11-1.log.gz datanode-cluster_deprecation.json datanode-cluster_deprecation.log datanode-cluster_server.json

Context

Data node testing.

Your Environment

todvora commented 1 month ago

It was a tricky situation. The datanode was failing because of a persistent block, preventing any writes to indices. This happens if your system is running out of disk space. Cleared /tmp, removed some downloaded datanode and server dists and tried again. Still the same block there.The problem is - the only way how to deal with blocks is by calling APIs in the running opensearch. But when we start the datanode and its opensearch, we immediately try to write into one of the indices (metrics, IsmApi). That fails and forces datanode to reboot the opensearch. And with the next reboot the same problem occurs. This happens for a while and then datanode gives up. There is never a moment when the opensearch API is available.Next step was to go back to the original plain opensearch. Start the opensearch service and look for blocks:

curl -XGET "http://localhost:9200/_cluster/settings"

And indeed, there was a persistent block. Removed it by calling

curl -d '{"persistent":{"cluster.blocks.create_index":false}}' -H "Content-Type: application/json" -X PUT http://localhost:9200/_cluster/settings

Stopped the plain opensearch, started datanode and it worked. Enough space, no blocks, writes went through. Graylog server could connect and everything is running fine now.The problem we should solve - how to remove blocks if we can't access the opensearch api? Should we check for blocks during startup and try to remove them automatically? Guard it by a configuration key?