canonical / microceph

Ceph for a one-rack cluster and appliances
https://snapcraft.io/microceph
GNU Affero General Public License v3.0
193 stars 25 forks source link

Mix of reef snap versions results in microceph.daemon service failures #367

Open javacruft opened 3 weeks ago

javacruft commented 3 weeks ago

Issue Report

What version of MicroCeph are you using

reef/edge but mix of versions - 981 and 1026

microceph             18.2.0+snap556b907075  1026   reef/edge           canonical**  - on node-01
microceph             18.2.0+snap71f71782c5  981    reef/edge           canonical**  held other nodes

What are the steps to reproduce this issue ?

Multi-node local deployment using https://microstack.run/docs

What happens (observed behaviour) ?

I believe one of the snaps refreshed which then caused some of the clustering daemons to fail with the following error:

un 13 12:40:04 node-02.dom systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: Error: Unable to start daemon: Daemon failed to start: Failed to re-establish cluster connection: Failed to update schema version when joining cluster: no such column: schema
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: time="2024-06-13T12:40:04+02:00" level=info msg="Daemon stopped"
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: time="2024-06-13T12:40:04+02:00" level=debug msg="Database error" err="Failed to update schema version when joining cluster: no such column: schema"

What were you expecting to happen ?

For the mix of different revisions to deal with this upgrade/change to the schema in a more elegant fashion

fnordahl commented 3 weeks ago

FWIW; we are having a similar issue and discussion with the LXD/Microcluster team in https://github.com/canonical/microovn/pull/121

UtkarshBhatthere commented 3 weeks ago

@sabaini is this the schema incompatibility thingy you mentioned yesterday ?

UtkarshBhatthere commented 3 weeks ago

@masnax any pointers on making it compatible with older revisions ?

sabaini commented 3 weeks ago

@UtkarshBhatthere yes was referring to this

sabaini commented 3 weeks ago

I could reproduce this locally by upgrading one out of three nodes from stable to edge

Steps:

In /v/l/syslog I see these messages:

Jun 14 08:29:48 aa-0 microceph.daemon[9040]: time="2024-06-14T08:29:48Z" level=debug msg="Database error" err="schema check gracefully aborted"
Jun 14 08:29:48 aa-0 microceph.daemon[9040]: time="2024-06-14T08:29:48Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://240.22.0.77:7443"

Which seems to hint at a schema migration issue

sabaini commented 3 weeks ago

Ticket CEPH-766

mkalcok commented 3 weeks ago

Bit more context can be also found here https://github.com/canonical/microcluster/issues/66. The bottom line is that this is currently an expected behavior. If there's a DB schema change, all members of the cluster must upgrade before the API becomes available again. We are in the talks (last few comments in the PR mentioned by @fnordahl) about improving the error message.

masnax commented 1 week ago

By the way, this should have been fixed by #371 which included https://github.com/canonical/microcluster/pull/150.