NethServer / dev

NethServer issue tracker
https://github.com/NethServer/dev/issues
63 stars 20 forks source link

Core update hangs indefinitely #6854

Closed DavidePrincipi closed 4 months ago

DavidePrincipi commented 4 months ago

Since I added a fast node to an existing cluster, the nightly apply-updates procedure blocks during update-core.

● redis.service - Core Redis DB
     Loaded: loaded (/etc/systemd/system/redis.service; enabled; preset: disabled)
     Active: active (running) since Tue 2024-02-20 02:45:50 CET; 13h ago
Feb 20 02:45:49.688798 ns8n5 agent@node[11056]: Running /var/lib/nethserver/node/update-core.d/95cleanup_images...
Feb 20 02:45:49.822616 ns8n5 agent@node[11056]: Failed to publish the action status on channel progress/node/15/task/9d6cec45-13e1-4799-a481-846ed0eb3469
Feb 20 02:45:49.822864 ns8n5 agent@node[11056]: task/node/15/9d6cec45-13e1-4799-a481-846ed0eb3469: update-core/95cleanup_images is starting
Feb 20 02:45:49.900384 ns8n5 agent@node[11056]: Failed to publish the action status on channel progress/node/15/task/9d6cec45-13e1-4799-a481-846ed0eb3469
Feb 20 02:45:49.900954 ns8n5 agent@node[11056]: Redis command failed: dial tcp 10.5.4.5:6379: connect: connection refused
Feb 20 02:45:49.900954 ns8n5 agent@node[11056]: task/node/15/9d6cec45-13e1-4799-a481-846ed0eb3469: action "update-core" status is "completed" (0) at step 95cleanup_images

From the log trace, there is no retry attempt to write the task output in Redis: after Redis is restarted, the node agent running on the fast node fails to publish its update-core exit status. As result the task outcome is never found by the controlling task running on the cluster leader and the whole action blocks.

Steps to reproduce

To define such action

mkdir /var/lib/nethserver/cluster/action/check-bug-6854
vi /var/lib/nethserver/cluster/actions/check-bug-6854/10restart_redis
chmod +x /var/lib/nethserver/cluster/actions/check-bug-6854/10restart_redis

In 10restart_redis:

#!/bin/bash
systemctl stop redis
systemctl start redis --no-block

Expected results

The action terminates.

**Actual results.

The action is blocked until I manually create a fake task exit status with MPUT.

Fix proposal

During Redis restarts the default go-redis library retry settings may not suffice

Increase the retry period of our agent.

Components

DavidePrincipi commented 4 months ago

Released in https://github.com/NethServer/ns8-core/releases/tag/2.5.2