canonical / microcloud

Automated private cloud based on LXD, Ceph and OVN
https://microcloud.is
GNU Affero General Public License v3.0
290 stars 44 forks source link

unable to add more nodes to microcloud #409

Open meska opened 1 month ago

meska commented 1 month ago

Hello, every time I try to add another node to MicroCloud I have this error: context deadline exceeded

existing cluster is 3 machines with ubuntu noble with already some working containers , so I would not like to reset the whole cluster.

microcloud cluster list
+---------+--------------------+-------+------------------------------------------------------------------+--------+
|  NAME   |      ADDRESS       | ROLE  |                           FINGERPRINT                            | STATUS |
+---------+--------------------+-------+------------------------------------------------------------------+--------+
| swarm8  | xxx.xxx.xxx.xxx:9443 | voter | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+-------+------------------------------------------------------------------+--------+
| swarm10 | xxx.xxx.xxx.xxx:9443 | voter | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+-------+------------------------------------------------------------------+--------+
| swarm11 | xxx.xxx.xxx.xxx:9443  | voter | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+-------+------------------------------------------------------------------+--------+
root@swarm8:~# microcloud add
Limit search for other MicroCloud servers to xxx.xxx.xxx.xxx/23? (yes/no) [default=yes]:
Scanning for eligible servers ...

 Selected "swarm12" at "xxx.xxx.xxx.xxx"

Would you like to set up local storage? (yes/no) [default=yes]: no
Would you like to set up distributed storage? (yes/no) [default=yes]:
Select from the available unpartitioned disks:

Select which disks to wipe:

 Using 1 disk(s) on "swarm12" for remote storage pool

No dedicated uplink interfaces detected, skipping distributed networking
Awaiting cluster formation ...
Error: System "swarm12" failed to join the cluster: Failed to update cluster status of services: Failed to join "MicroCloud" cluster: Failed to join cluster: context deadline exceeded
root@swarm8:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core22      20240904                1621   latest/stable  canonical✓  base
core24      20240710                490    latest/stable  canonical✓  base
lxd         5.21.2-2f4ba6b          30131  5.21/stable    canonical✓  in-cohort
microceph   18.2.4+snapc9f2b08f92   1139   latest/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort
snapd       2.63                    21759  latest/stable  canonical✓  snapd
root@swarm8:~#

if I try again I have those errors:

Error: Failed to issue MicroOVN token for peer "swarm12": Failed to create "internal_token_records" entry: UNIQUE constraint failed: internal_token_records.name

Error: Failed to issue MicroCloud token for peer "swarm12": Failed to create "internal_token_records" entry: UNIQUE constraint failed: internal_token_records.name

Error: Failed to issue MicroCeph token for peer "swarm12": Failed to create "internal_token_records" entry: UNIQUE constraint failed: internal_token_records.name

I'm able to clean the error with this commands:

microcloud cluster remove swarm12
microovn cluster remove swarm12
lxc cluster revoke-token swarm12
lxd sql local "DELETE FROM certificates WHERE name='swarm12'"
lxd sql global "DELETE FROM identities WHERE name='swarm12'"
microcloud sql "DELETE FROM internal_token_records WHERE name='swarm12'"
microovn cluster sql "DELETE FROM internal_token_records WHERE name='swarm12'"
microceph cluster sql "DELETE FROM internal_token_records WHERE name='swarm12'"

but If I try again I always end with Failed to join cluster: context deadline exceeded, after this error I have also this:

root@swarm8:~# microcloud cluster list
+---------+--------------------+----------+------------------------------------------------------------------+--------+
|  NAME   |      ADDRESS       |   ROLE   |                           FINGERPRINT                            | STATUS |
+---------+--------------------+----------+------------------------------------------------------------------+--------+
| swarm8  | xxx.xxx.xxx.xxx:9443 | voter    | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+----------+------------------------------------------------------------------+--------+
| swarm10 | xxx.xxx.xxx.xxx:9443 | voter    | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+----------+------------------------------------------------------------------+--------+
| swarm11 | xxx.xxx.xxx.xxx:9443  | voter    | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+----------+------------------------------------------------------------------+--------+
| swarm12 | xxx.xxx.xxx.xxx:9443 | stand-by | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ONLINE |
+---------+--------------------+----------+------------------------------------------------------------------+--------+

( microceph, lxc, microovn cluster list does not have swarm12 )

root@swarm12:~# microcloud cluster list
Error: Daemon not yet initialized

already tried to purge and reinstall everything on the new server. servers have near the same hardware , same ubuntu distribution and kernel, 1 ssd for the os and 1 empty hdd for ceph

sometimes I have this line on the syslog: 2024-10-01T18:47:36.442796+02:00 swarm8 kernel: audit: type=1400 audit(1727801256.440:7764): apparmor="DENIED" operation="open" class="file" profile="snap.microcloud.microcloud" name="/var/lib/snapd/hostfs/etc/ssl/certs/ca-certificates.crt" pid=1788204 comm="microcloud" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 but I already added /var/lib/snapd/hostfs/etc/ssl/certs/ca-certificates.crt r, to apparmor profile: the error disappears but still not working

do you have any clue on what to try ? If I use microcloud add --verbose --debug where can I see the debug output?

or there is any method to configure manually the cluster maybe configuring some files on the new one ?

masnax commented 1 month ago

Looks like the swarm12 got into an inconsistent state across the cluster, so you can try fully cleaning it up and trying again:

You mentioned swarm12 only appears in the microcloud cluster, so you can use microcloud cluster remove swarm12 --force to get rid of it. Clean up the join tokens like you did with the sql command on the other systems. You can also check if there is a file located at /var/snap/common/microcloud/state/truststore/swarm12.yaml on existing members and remove it if so.

Then uninstall and reinstall the snaps on swarm12, and try microcloud add again.

Logs should be present via snap logs microcloud, you should capture them both for the joiner and the system running microcloud add.

meska commented 1 month ago

So I cleaned all three working nodes and checked the truststore for remaining files ( rm /var/snap/*/common/state/truststore/swarm12.yaml )

On the joiner I did: snap remove microcloud lxd microceph microovn --purge && snap install lxd --channel=5.21/stable --cohort="+" && snap install microceph --channel=latest/stable --cohort="+" && snap install microovn --channel=22.03/stable --cohort="+" && snap install microcloud --channel=latest/stable --cohort="+" && reboot

the system running microcloud add -verbose --debug remains silent with snap logs microcloud -f but the logs on the joiner is full of mdns spams from printers and this line 2024-10-01T23:26:25+02:00 microcloud.daemon[2175]: time="2024-10-01T23:26:25+02:00" level=error msg="Failed to start server" err="accept tcp [::]:9443: use of closed network connection"

...but it seems from a response you gave in another thread that this is not an issue.

after all this, same error: context deadline exceeded

mdns spam:

2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{Brother_HL_L6
2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{Brother_HL_L6
2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{EtcMag\ \@\ w
2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{TestClass\ \@
2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{TestClass\ \@
2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{TestClass\ \@
2024-10-01T23:04:10+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:10 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{TestClass\ \@
2024-10-01T23:04:11+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:11 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{HP_LaserJet_M
2024-10-01T23:04:11+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:11 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{HP_LaserJet_M
2024-10-01T23:04:11+02:00 microcloud.daemon[1070]: 2024/10/01 23:04:11 [ERR] mdns: Failed to handle query: [ERR] mdns: support for DNS requests with high truncated bit not implemented: {{0 false 0 false true false false false false false 0} false [{HP_LaserJet_M
roosterfish commented 1 week ago

@meska we have just released a new version of MicroCloud that uses a slightly different mechanism for the multicast discovery.

If you can please upgrade your existing MicroCloud and check if you can add the additional nodes. See the release post.