canonical / microceph

MicroCeph is snap-deployed Ceph with built-in clustering
https://snapcraft.io/microceph
GNU Affero General Public License v3.0
195 stars 28 forks source link

Upgrade: microcephd fails "Control socket already present" #385

Closed sabaini closed 2 weeks ago

sabaini commented 2 weeks ago

Issue report

What version of MicroCeph are you using ?

Upgrade from reef/stable to reef/candidate

What are the steps to reproduce this issue ?

What happens (observed behaviour) ?

MicroCeph cluster commands are erroring b/c microcephd fails to come up post-upgrade, complaining about

Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[7956]: Error: Daemon stopped with error: Control socket already present ("/var/snap/microceph/common/state/control.socket"); is another daemon already running?
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.

What were you expecting to happen ?

MicroCeph cluster commands to work, e.g. sudo microceph cluster list

Relevant logs, error output, etc.

More context from syslog

Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=info msg=" - binding control socket" socket=/var/snap/microceph/common/state/control.socket
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=info msg=" - binding https socket" network="10.18.140.41:7443"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:38472" remote="10.18.140.32:7443"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:38480" remote="10.18.140.32:7443"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:38736" remote="10.18.140.198:7443"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:48242" remote="10.18.140.41:7443"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Matched trusted cert" fingerprint=a02c784bb8099ea29278af7d6f6cc8fa37d9c2a0350d47edbe4cd8195b161afe subject="CN=root@juju-5b092d-0,O=LXD"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Trusting HTTP request to \"/cluster/internal/database\" from \"10.18.140.41:48242\" with fingerprint \"a02c784bb8099ea29278af7d6f6cc8fa37d9c2a0350d47edbe4cd8195b161afe\""
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:38488" remote="10.18.140.32:7443"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=debug msg="Database error" err="schema check gracefully aborted"
Jul 10 14:11:21 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:21Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://10.18.140.41:7443"
Jul 10 14:11:22 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:22Z" level=debug msg="Matched trusted cert" fingerprint=ba757899d79b07c06b0cb9a75ace56fa393c9f821cf0c59219d760c3cee69889 subject="CN=root@juju-5b092d-2,O=LXD"
Jul 10 14:11:22 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:22Z" level=debug msg="Trusting HTTP request to \"/cluster/internal/database\" from \"10.18.140.32:45156\" with fingerprint \"ba757899d79b07c06b0cb9a75ace56fa393c9f821cf0c59219d760c3cee69889\""
Jul 10 14:11:22 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:22Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:38500" remote="10.18.140.32:7443"
Jul 10 14:11:22 juju-5b092d-0 snapd[225]: storehelpers.go:923: cannot refresh snap "microceph": snap has no updates available
Jul 10 14:11:22 juju-5b092d-0 systemd[1]: session-8.scope: Deactivated successfully.
Jul 10 14:11:30 juju-5b092d-0 systemd[1]: Started snap.microceph.microceph-462e6a1a-7576-4572-88f8-4309f0289993.scope.
Jul 10 14:11:30 juju-5b092d-0 systemd[1]: snap.microceph.microceph-462e6a1a-7576-4572-88f8-4309f0289993.scope: Deactivated successfully.
Jul 10 14:11:34 juju-5b092d-0 systemd[1]: Started snap.microceph.microceph-2b442c87-7855-4f0f-8e50-9cdae488457a.scope.
Jul 10 14:11:34 juju-5b092d-0 systemd[1]: snap.microceph.microceph-2b442c87-7855-4f0f-8e50-9cdae488457a.scope: Deactivated successfully.
Jul 10 14:11:34 juju-5b092d-0 systemd[1]: Started snap.microceph.microceph-56f4a664-7130-475c-beee-d774509528f3.scope.
Jul 10 14:11:34 juju-5b092d-0 systemd[1]: snap.microceph.microceph-56f4a664-7130-475c-beee-d774509528f3.scope: Deactivated successfully.
Jul 10 14:11:40 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:40Z" level=debug msg="Matched trusted cert" fingerprint=ba757899d79b07c06b0cb9a75ace56fa393c9f821cf0c59219d760c3cee69889 subject="CN=root@juju-5b092d-2,O=LXD"
Jul 10 14:11:40 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:40Z" level=debug msg="Trusting HTTP request to \"/cluster/internal/database\" from \"10.18.140.32:37588\" with fingerprint \"ba757899d79b07c06b0cb9a75ace56fa393c9f821cf0c59219d760c3cee69889\""
Jul 10 14:11:46 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:46Z" level=debug msg="Matched trusted cert" fingerprint=388a6c6ee48b590526936e1946d125a2c59c213d27d252ca1bc6d5a73144fe17 subject="CN=root@juju-5b092d-1,O=LXD"
Jul 10 14:11:46 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:46Z" level=debug msg="Trusting HTTP request to \"/cluster/internal/database\" from \"10.18.140.198:47938\" with fingerprint \"388a6c6ee48b590526936e1946d125a2c59c213d27d252ca1bc6d5a73144fe17\""
Jul 10 14:11:51 juju-5b092d-0 systemd[1]: Started snap.microceph.microceph-7b0a7ed4-d4ca-4ec0-8714-ce225e445f06.scope.
Jul 10 14:11:51 juju-5b092d-0 systemd[1]: snap.microceph.microceph-7b0a7ed4-d4ca-4ec0-8714-ce225e445f06.scope: Deactivated successfully.
Jul 10 14:11:51 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:51Z" level=debug msg="Dqlite connected outbound" local="10.18.140.41:54994" remote="10.18.140.32:7443"
Jul 10 14:11:51 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:51Z" level=debug msg="Closing database after upgrade notification" address="https://10.18.140.41:7443"
Jul 10 14:11:51 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:51Z" level=info msg="Closing filesystem watcher"
Jul 10 14:11:51 juju-5b092d-0 microceph.daemon[7569]: time="2024-07-10T14:11:51Z" level=info msg="Daemon stopped"
Jul 10 14:11:51 juju-5b092d-0 microceph.daemon[7569]: Error: Daemon stopped with error: Daemon failed to start: Failed to re-establish cluster connection: Failed to join dqlite cluster failed to autobind unix socket: listen unix @snap.microceph.dqlite: bind: address already in use
Jul 10 14:11:51 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 14:11:51 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 1.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Stopped Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Started Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[7956]: time="2024-07-10T14:11:52Z" level=info msg="Daemon stopped"
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[7956]: Error: Daemon stopped with error: Control socket already present ("/var/snap/microceph/common/state/control.socket"); is another daemon already running?
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 2.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Stopped Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Started Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[7986]: time="2024-07-10T14:11:52Z" level=info msg="Daemon stopped"
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[7986]: Error: Daemon stopped with error: Control socket already present ("/var/snap/microceph/common/state/control.socket"); is another daemon already running?
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 3.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Stopped Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Started Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[8016]: time="2024-07-10T14:11:52Z" level=info msg="Daemon stopped"
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[8016]: Error: Daemon stopped with error: Control socket already present ("/var/snap/microceph/common/state/control.socket"); is another daemon already running?
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 4.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Stopped Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 systemd[1]: Started Service for snap application microceph.daemon.
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[8044]: time="2024-07-10T14:11:52Z" level=info msg="Daemon stopped"
Jul 10 14:11:52 juju-5b092d-0 microceph.daemon[8044]: Error: Daemon stopped with error: Control socket already present ("/var/snap/microceph/common/state/control.socket"); is another daemon already running?

Additional comments.

Note this does not seem to happen on a single node install which leads me to suspect an issue around microcluster upgrade and version handling

sabaini commented 2 weeks ago

@masnax wondering if you have seen something like this before?

MggMuggins commented 2 weeks ago

Introduced in https://github.com/canonical/microcluster/commit/caa46ad83776ba6063ec3ec7eee48fb9412ed15c and reverted in https://github.com/canonical/microcluster/commit/79dbf7b4fe5f4c1e8590fe8c32b10be47df06f00; updating your MicroCluster dep should resolve this.

sabaini commented 2 weeks ago

Can confirm #383 fixes this.