Firewalld ns-wireguard service name conflict

DavidePrincipi commented 1 week ago

After a failed join attempt, the node RL2 is left in an invalid state: it cannot rejoin the cluster or become the first node of a new cluster.

Steps to reproduce

Install NS8 on a host RL1 with an invalid domain suffix, e.g. dp.test
Initialize RL1 as leader node, with the invalid domain suffix
After cluster initialization, go to the Nodes page and change the leader FQDN to a valid one, e.g. rl1.dp.nethserver.net
Join the second node, RL2: the join-node fails.

Expected behavior

I expect the join works, or I can recover from the error by some means.

Actual behavior

Despite the error message, RL2 UI shows a link to the leader node, giving me the impression that the join in the end was successful.
If I reload the page, RL2 shows again the initial choice screen to choose among create cluster, join node, restore from backup.
If I choose create-cluster, the create-cluster procedure configures RL2 as leader of a new cluster, but a conflict on the ns-wireguard firewall service occurs.

In RL2 journal, the original join failure

Jun 24 07:25:05 rl2 agent@cluster[31660]: task/cluster/0aead8ad-0578-4cdd-b255-73b9cf71df4f: join-node/30start_replication is starting
Jun 24 07:25:05 rl2 traefik[32066]: 80.17.99.73 - - [24/Jun/2024:07:25:05 +0000] "GET /cluster-admin/api/cluster/task/0aead8ad-0578-4cdd-b255-73b9cf71df4f/context HTTP/2.0" 200 309 "-" "-" 177 "ApiServer-https@file" "http://127.0.0.1:9311>
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.060 * 1 changes in 5 seconds. Saving...
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.061 * Background saving started by pid 31
Jun 24 07:25:06 rl2 redis[31501]: 31:C 24 Jun 2024 07:25:06.068 * DB saved on disk
Jun 24 07:25:06 rl2 redis[31501]: 31:C 24 Jun 2024 07:25:06.068 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.161 * Background saving terminated with success
Jun 24 07:25:07 rl2 agent@cluster[31660]: sed -i -e '/^AGENT_ID=/c\AGENT_ID=node/2' -e '/^REDIS_USER=/c\REDIS_USER=node/2' /var/lib/nethserver/node/state/agent.env
Jun 24 07:25:07 rl2 agent@cluster[31660]: Traceback (most recent call last):
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/var/lib/nethserver/cluster/actions/join-node/30start_replication", line 63, in <module>
Jun 24 07:25:07 rl2 agent@cluster[31660]:     cluster.vpn.initialize_wgconf(ip_address, peer={
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/usr/local/agent/pypkg/cluster/vpn.py", line 36, in initialize_wgconf
Jun 24 07:25:07 rl2 agent@cluster[31660]:     peer_ep_address = socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]
Jun 24 07:25:07 rl2 agent@cluster[31660]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/usr/lib64/python3.11/socket.py", line 962, in getaddrinfo
Jun 24 07:25:07 rl2 agent@cluster[31660]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Jun 24 07:25:07 rl2 agent@cluster[31660]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 24 07:25:07 rl2 agent@cluster[31660]: socket.gaierror: [Errno -2] Name or service not known
Jun 24 07:25:08 rl2 agent@cluster[31660]: task/cluster/0aead8ad-0578-4cdd-b255-73b9cf71df4f: action "join-node" status is "aborted" (1) at step 30start_replication

The action create-cluster on RL2 fails with

Error: NAME_CONFLICT: new_service(): 'ns-wireguard'

Components

Core 2.8.4

Thanks to @nrauso

DavidePrincipi commented 5 days ago

Test case 0

Install two nodes with the Core testing release 2.8.5-dev.2
Set an invalid domain in the leader node FQDN
Try to join a node: the invalid FQDN is shown in the join validation error
Proceed with test case 1, by changing the leader FQDN as written above in the bug description

Test case 1

Check the join works after fixing the VPN endpoint with this command (assuming 1 is the NODE_ID of leader):

redis-cli hset node/1/vpn endpoint rl1.dp.nethserver.net:55820

The bug is fixed if the worker node is still capable of joining the cluster after a failed attempt.

nrauso commented 4 days ago

test case 0: VERIFIED

In the event of an invalid domain for the leader, the join attempts generate a clear error:

join01 join02

test case 1: VERIFIED

Once the new, correct FQDN for the leader is set and the VPN endpoint is fixed in redis, the join works flawlessly.

DavidePrincipi commented 3 days ago

Released in https://github.com/NethServer/ns8-core/releases/tag/2.8.5

NethServer / dev

Firewalld ns-wireguard service name conflict #6958