SecDev EC2 test instance datanode migration failure

Graylog2 / graylog2-server

Free and open log management

Other

7.32k stars 1.05k forks source link

Attempted to run an in-place data node migration on on the green.secdev.torch.sh EC2 test instance using 6.1 alpha-4 Graylog and data node builds.

Progress appeared to be going smoothly until the final step of the process which hung for 20-30 mins with datat node never becoming Available.
At which point I noticed the Next button being active and (possibly erroneously?) took that to mean it was time to restart the Graylog server, which I did.
Upon Graylog restart got a ton of OS connection errors
Tried restarting data node and got various errors

Graylog server logs: graylog-server.log

Data node logs:

The migration look place around log time 2024-07-11 19:40, however there doesn't appear to be anything in the main logs from that time, and the logs in datanode-cluster-2024-07-11-1.log.gz only contain a few error messages for a troubleshooting attempt to restart datanode as root.

The subsequent logging from the 12th is the error output from non-root datanode restart attempts.

datanode-cluster.log datanode-cluster-2024-07-11-1.json.gz datanode-cluster-2024-07-11-1.log.gz datanode-cluster_deprecation.json datanode-cluster_deprecation.log datanode-cluster_server.json

Context

Data node testing.

Your Environment

Graylog Version: 6.1 alpha-4
Java Version: 17.0.11
OpenSearch Version: 2.12 (pre-migration)
MongoDB Version: 5.0.27
Operating System: Linux 5.10.210-201.852.amzn2.x86_64
Browser version:

It was a tricky situation. The datanode was failing because of a persistent block, preventing any writes to indices. This happens if your system is running out of disk space. Cleared /tmp, removed some downloaded datanode and server dists and tried again. Still the same block there.The problem is - the only way how to deal with blocks is by calling APIs in the running opensearch. But when we start the datanode and its opensearch, we immediately try to write into one of the indices (metrics, IsmApi). That fails and forces datanode to reboot the opensearch. And with the next reboot the same problem occurs. This happens for a while and then datanode gives up. There is never a moment when the opensearch API is available.Next step was to go back to the original plain opensearch. Start the opensearch service and look for blocks:

curl -XGET "http://localhost:9200/_cluster/settings"

And indeed, there was a persistent block. Removed it by calling

curl -d '{"persistent":{"cluster.blocks.create_index":false}}' -H "Content-Type: application/json" -X PUT http://localhost:9200/_cluster/settings

Stopped the plain opensearch, started datanode and it worked. Enough space, no blocks, writes went through. Graylog server could connect and everything is running fine now.The problem we should solve - how to remove blocks if we can't access the opensearch api? Should we check for blocks during startup and try to remove them automatically? Guard it by a configuration key?

Graylog2 / graylog2-server

SecDev EC2 test instance datanode migration failure #19900

Context

Your Environment