aquarist-labs / aquarium

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.
https://aquarist-labs.io/
Other
71 stars 23 forks source link

`cephadm bootstrap` failures are spectacularly inscrutable #707

Closed tserong closed 1 year ago

tserong commented 2 years ago

After you hit the "Install" button when creating a new cluster, Aquarium goes off and runs cephadm bootstrap. If this fails for some reason, all you see is a red box at the bottom of the screen which says "Failed to bootstrap the system", with no further information about what might be wrong:

failed-to-bootstrap-the-system

This error message disappears after a short amount of time, and your only option is to hit the "Install" button again and hope for a different outcome.

The aquarium log (journalctl -u aquarium) will give you something baffling, like:

[ERROR] 2021-10-14T08:43:52 -- deployment -- bootstrap finish error: error bootstrapping: rc = 1
[INFO ] 2021-10-14T08:43:52 -- mgr -- finish deployment config
[ERROR] 2021-10-14T08:43:52 -- base_events -- Task exception was never retrieved
future: <Task finished name='Task-407' coro=<Bootstrap._do_bootstrap() done, defined at /srv/aquarium/src/./gravel/controllers/nodes/bootstrap.py:83> exception=FileNotFoundError('/etc/ceph/ceph.conf')>
Traceback (most recent call last):
  File "/srv/aquarium/src/./gravel/controllers/nodes/bootstrap.py", line 106, in _do_bootstrap
    await cb(False, f"error bootstrapping: rc = {retcode}")
  File "/srv/aquarium/src/./gravel/controllers/nodes/deployment.py", line 418, in finish_bootstrap_cb
    await post_bootstrap_cb(success, error)
  File "/srv/aquarium/src/./gravel/controllers/nodes/mgr.py", line 419, in _post_bootstrap_finisher
    await self._post_bootstrap_config()
  File "/srv/aquarium/src/./gravel/controllers/nodes/mgr.py", line 434, in _post_bootstrap_config
    mon.set_allow_pool_size_one()
  File "/srv/aquarium/src/./gravel/controllers/orch/ceph.py", line 390, in set_allow_pool_size_one
    r = self.config_set("global", "mon_allow_pool_size_one", "true")
  File "/srv/aquarium/src/./gravel/controllers/orch/ceph.py", line 262, in config_set
    self.call(cmd)
  File "/srv/aquarium/src/./gravel/controllers/orch/ceph.py", line 199, in call
    return self.ceph.mon(cmd)
  File "/srv/aquarium/src/./gravel/controllers/orch/ceph.py", line 174, in mon
    self.connect()
  File "/srv/aquarium/src/./gravel/controllers/orch/ceph.py", line 99, in connect
    self._check_config()
  File "/srv/aquarium/src/./gravel/controllers/orch/ceph.py", line 96, in _check_config
    raise FileNotFoundError(self.conf_file)
FileNotFoundError: /etc/ceph/ceph.conf

The actual cause of the problem can be found buried in /var/log/ceph/cephadm.log. In my case it was:

2021-10-14 08:43:52,913 INFO /usr/bin/podman: stderr Trying to pull docker.io/ceph/ceph:v16...
2021-10-14 08:43:52,913 INFO /usr/bin/podman: stderr Error: initializing source docker://ceph/ceph:v16: pinging container registry registry-1.doc
ker.io: Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:44106->[::1]:53: read: connection
 refused

This was caused by me breaking my network configuration and having no default route, so of course podman pull can't do it's thing, but the problem is I had no idea why, until I went digging.

IMO we need to:

1) Somehow propagate more useful failure information from cephadm bootstrap stdout/stderr/log/whatever to the UI 2) See if we can make /var/log/cephadm/cephadm.log any easier to read. AFAICT there's no indication in there of which cephadm command was invoked (bootstrap, inventory, ...) - instead you have to correlate with the aquarium log to see what was invoked, and when.

(Item 2 is almost certainly something to address in cephadm itself and not specific to Aquarium)

tserong commented 2 years ago

Still inscrutable in the UI when testing #758, but the visuals are slightly different:

image

Here's what journalctl -u aquarium had to say during my test, so the log is definitely helpful:

Dec 21 07:15:04 node1 uvicorn[1444]: INFO:     2021-12-21 07:15:04 -- create -- Bootstrap complete with success.
Dec 21 07:15:06 node1 uvicorn[1444]: ERROR:    2021-12-21 07:15:06 -- ceph -- error running command: rc = -2, reason = all mgr daemons do not support module 'bubbles', pass --force to force enablement
Dec 21 07:15:06 node1 uvicorn[1444]: ERROR:    2021-12-21 07:15:06 -- ceph -- unable to enable module bubbles: all mgr daemons do not support module 'bubbles', pass --force to force enablement
Dec 21 07:15:06 node1 uvicorn[1444]: ERROR:    2021-12-21 07:15:06 -- create -- Unable to start Bubbles.
Dec 21 07:15:06 node1 uvicorn[1444]: ERROR:    2021-12-21 07:15:06 -- create -- Create error: Failed configuring the deployment.
Dec 21 07:15:06 node1 uvicorn[1444]: INFO:     2021-12-21 07:15:06 -- create -- Waiting for task to finish.
Dec 21 07:15:06 node1 uvicorn[1444]: ERROR:    2021-12-21 07:15:06 -- mgr -- Error creating deployment: Failed configuring the deployment.