Migration tool duplicates Redis keys of node

DavidePrincipi commented 3 months ago

If the ns8-join command of the migration tool fails, a duplicate Redis key is generated for each failed attempt. If many failed attempts were run, the Wireguard peer table is polluted by duplicates and the wg0 configuration breaks.

Steps to reproduce

Create a cluster with a bad VPN host endpoint (explode the VPN advanced form to find it). For example, set myhost.dom.test. As consequence, the leader FQDN is not in DNS: it is a condition that despite the docs, is often forgot.
Join NS8 cluster with the IP address. E.g. ns8-join --no-tlsverify <LEADER_IP> admin Nethesis,1234
Leave the cluster, e.g. ns8-leave
Repeat leave/join steps 5 times

Expected behavior

Join fails. Only the last join attempt is left in the Redis DB, with the higher NODE_ID.

Actual behavior

After last join attempt in ns7:

[root@nscom2 ~]# config show wg-quick@ns8 
wg-quick@ns8=service
    Address=10.5.4.7
    RemoteEndpoint=rl1.dom.test:55820
    RemoteKey=XXXXXXXX
    RemoteNetwork=10.5.4.0/24
    status=enabled

Node keys from the first join attempt are still in place:

[root@rl1 ~]# redis-cli keys node/*/vpn
1) "node/7/vpn"
2) "node/5/vpn"
3) "node/4/vpn"
4) "node/6/vpn"
5) "node/3/vpn"
6) "node/2/vpn"
7) "node/1/vpn"

They overwrite the Wireguard "allowed ips" field, breaking the VPN configuration:

[root@rl1 ~]# wg
interface: wg0
  public key: pfd5Bm8HnII6ZC18Ojuhrn02sBen1fvDX29KroKARxs=
  private key: (hidden)
  listening port: 55820

peer: RKUWF/SLwotQJq5OfDxUFSoHhSZ0D7kwGMAocwX9FSI=
  allowed ips: 10.5.4.5/32
  persistent keepalive: every 25 seconds

:warning: note IP 10.5.4.5, from a stale Redis node key.

Components

core 2.8.1
nethserver-ns8-migration-1.0.12-1.ns7.x86_64

Thanks to @mrmarkuz

DavidePrincipi commented 3 months ago

In testing:

core 2.8.2-dev.1

nethbot commented 3 months ago

in 7.9.2009/testing:

nethserver-ns8-migration-1.0.12-1.10.g2e6b735.ns7.x86_64.rpm x86_64

DavidePrincipi commented 3 months ago

Test case 1

With core 2.8.2-dev.1 the add-node action does not allow calls to add-node with a public_key already used. For example you can

configure a two nodes cluster
go to the audit trail page and copy the add-node payload

Execute manually the action:

api-cli run add-node --data <PAYLOAD_HERE>

Test case 2

The bug must be not reproducible with nethserver-ns8-migration from testing, with and without core 2.8.2-dev.1 (which is just a safety net validator for the cluster).

With the testing release,

if a connection to NS8 fails, the UI shows the error message and writes it to the log.
if one or more connections failed in the past, the testing release works around them and completes the migration. Bogus Redis entries must be removed manually though.

nrauso commented 3 months ago

test case 1: VERIFIED

You're not allowed to reuse an active public key:

~]# api-cli run add-node --data - <<EOF
> {
    "endpoint": "",
    "node_pwh": "8f01f499f7dfdf55a083515e3c7706917b6b67ddebb13ca555552636a32000ae",
    "public_key": "3VdMc/oIhm5vysZVDkHZ+Vlzzryl3R6YFgT/9Dro7RA="
  }
> EOF
Warning: using user "cluster" credentials from the environment
<4>The public key 3VdMc/oIhm5vysZVDkHZ+Vlzzryl3R6YFgT/9Dro7RA= is already used by node 2
[{"field": "public_key", "parameter": "public_key", "error": "public_key_matches_existing_node", "value": "2"}]

test case 2: VERIFIED

In case of working join after previous failed attempts, the wireguard config is coherent on NS7 side:

~]# config show ns8
ns8=configuration
    Host=rl11.nr.nethserver.net
    LeaderIpAddress=10.5.4.1
    Password=MyTestPAss
    TLSVerify=disabled
    User=admin

~]# wg
interface: ns8
  public key: kN1yyzDbnAhFhw2m4dcY/nVOjgcl7M0QquKn4ZNs9i0=
  private key: (hidden)
  listening port: 44916

peer: oYouWUkvqlcYB13KmXQe67SN5dQ3AsTkttKONO3AjWg=
  endpoint: 165.232.65.11:55820
  allowed ips: 10.5.4.0/24
  latest handshake: 13 seconds ago
  transfer: 10.25 KiB received, 11.52 KiB sent

NS8 leader and NS7 correctly talk each other. On NS8 side you need to clean up bogus wireguard configs:

~]# redis-cli keys *vpn*
1) "node/1/vpn"
2) "node/4/vpn"
3) "node/3/vpn"
4) "node/2/vpn"

~]# wg
interface: wg0
  public key: oYouWUkvqlcYB13KmXQe67SN5dQ3AsTkttKONO3AjWg=
  private key: (hidden)
  listening port: 55820

peer: kN1yyzDbnAhFhw2m4dcY/nVOjgcl7M0QquKn4ZNs9i0=
  endpoint: 164.92.229.123:44916
  allowed ips: 10.5.4.4/32
  latest handshake: 34 seconds ago
  transfer: 11.55 KiB received, 10.25 KiB sent
  persistent keepalive: every 25 seconds

peer: /Q9I0ILStidtyyo/IdGjVsveBrs3NDjAzGNJB+s7XAI=
  allowed ips: 10.5.4.2/32
  persistent keepalive: every 25 seconds

peer: 8+oya7v8BSMLjSiRQ2FVwMRUU3XkpO60JLDqa6ydVSs=
  allowed ips: 10.5.4.3/32
  persistent keepalive: every 25 seconds

nethbot commented 3 months ago

in 7.9.2009/testing:

nethserver-ns8-migration-1.0.12-1.13.g12e7dbc.ns7.x86_64.rpm x86_64

nethbot commented 3 months ago

in 7.9.2009/updates:

nethserver-ns8-migration-1.0.13-1.ns7.x86_64.rpm x86_64

DavidePrincipi commented 3 months ago

Released https://github.com/NethServer/ns8-core/releases/tag/2.8.2

NethServer / dev

Migration tool duplicates Redis keys of node #6940

test case 1: VERIFIED

test case 2: VERIFIED