NethServer / dev

NethServer issue tracker
https://github.com/NethServer/dev/issues
63 stars 20 forks source link

Firewalld ns-wireguard service name conflict #6958

Closed DavidePrincipi closed 3 days ago

DavidePrincipi commented 1 week ago

After a failed join attempt, the node RL2 is left in an invalid state: it cannot rejoin the cluster or become the first node of a new cluster.

Steps to reproduce

Expected behavior

I expect the join works, or I can recover from the error by some means.

Actual behavior

  1. Despite the error message, RL2 UI shows a link to the leader node, giving me the impression that the join in the end was successful.

  2. If I reload the page, RL2 shows again the initial choice screen to choose among create cluster, join node, restore from backup.

  3. If I choose create-cluster, the create-cluster procedure configures RL2 as leader of a new cluster, but a conflict on the ns-wireguard firewall service occurs.

In RL2 journal, the original join failure

Jun 24 07:25:05 rl2 agent@cluster[31660]: task/cluster/0aead8ad-0578-4cdd-b255-73b9cf71df4f: join-node/30start_replication is starting
Jun 24 07:25:05 rl2 traefik[32066]: 80.17.99.73 - - [24/Jun/2024:07:25:05 +0000] "GET /cluster-admin/api/cluster/task/0aead8ad-0578-4cdd-b255-73b9cf71df4f/context HTTP/2.0" 200 309 "-" "-" 177 "ApiServer-https@file" "http://127.0.0.1:9311>
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.060 * 1 changes in 5 seconds. Saving...
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.061 * Background saving started by pid 31
Jun 24 07:25:06 rl2 redis[31501]: 31:C 24 Jun 2024 07:25:06.068 * DB saved on disk
Jun 24 07:25:06 rl2 redis[31501]: 31:C 24 Jun 2024 07:25:06.068 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
Jun 24 07:25:06 rl2 redis[31501]: 1:M 24 Jun 2024 07:25:06.161 * Background saving terminated with success
Jun 24 07:25:07 rl2 agent@cluster[31660]: sed -i -e '/^AGENT_ID=/c\AGENT_ID=node/2' -e '/^REDIS_USER=/c\REDIS_USER=node/2' /var/lib/nethserver/node/state/agent.env
Jun 24 07:25:07 rl2 agent@cluster[31660]: Traceback (most recent call last):
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/var/lib/nethserver/cluster/actions/join-node/30start_replication", line 63, in <module>
Jun 24 07:25:07 rl2 agent@cluster[31660]:     cluster.vpn.initialize_wgconf(ip_address, peer={
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/usr/local/agent/pypkg/cluster/vpn.py", line 36, in initialize_wgconf
Jun 24 07:25:07 rl2 agent@cluster[31660]:     peer_ep_address = socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]
Jun 24 07:25:07 rl2 agent@cluster[31660]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 24 07:25:07 rl2 agent@cluster[31660]:   File "/usr/lib64/python3.11/socket.py", line 962, in getaddrinfo
Jun 24 07:25:07 rl2 agent@cluster[31660]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Jun 24 07:25:07 rl2 agent@cluster[31660]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 24 07:25:07 rl2 agent@cluster[31660]: socket.gaierror: [Errno -2] Name or service not known
Jun 24 07:25:08 rl2 agent@cluster[31660]: task/cluster/0aead8ad-0578-4cdd-b255-73b9cf71df4f: action "join-node" status is "aborted" (1) at step 30start_replication

The action create-cluster on RL2 fails with

Error: NAME_CONFLICT: new_service(): 'ns-wireguard'

Components

See also

Discussion (PVT) https://mattermost.nethesis.it/nethesis/pl/rqr3abki53rr9ngsrxpeow835h


Thanks to @nrauso

DavidePrincipi commented 5 days ago

Test case 0

Test case 1

Check the join works after fixing the VPN endpoint with this command (assuming 1 is the NODE_ID of leader):

redis-cli hset node/1/vpn endpoint rl1.dp.nethserver.net:55820

The bug is fixed if the worker node is still capable of joining the cluster after a failed attempt.

nrauso commented 4 days ago

test case 0: VERIFIED

In the event of an invalid domain for the leader, the join attempts generate a clear error:

join01 join02

test case 1: VERIFIED

Once the new, correct FQDN for the leader is set and the VPN endpoint is fixed in redis, the join works flawlessly.

DavidePrincipi commented 3 days ago

Released in https://github.com/NethServer/ns8-core/releases/tag/2.8.5