NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.51k stars 13.69k forks source link

ceph 18.2.4 broken with cryptsetup/dmcrypt #334227

Closed benaryorg closed 1 month ago

benaryorg commented 1 month ago

Describe the bug

Ceph version 18.2.4 is currently not usable with dmcrypt volumes and has been backported to NixOS 24.05 (and subsequently may break clusters shortly).

Steps To Reproduce

Steps to reproduce the behavior:

  1. upgrade existing system which uses ceph-volume lvm activate under the hood (with a dmcrypt'd bluestore)
  2. try to use said command to activate the volume

I am experiencing this with 8f4cb508c33212aa69ae22958d03c0ba9a906f5b, however I'm pretty sure #330226 (backported via #333401) is the culprit, but I haven't confirmed this yet.

Expected behavior

The bluestore volume is activated.

Additional context

The issue was raised on the ceph-users mailinglist referring to a lack of a backport for a commit that fixes this (supposedly, haven't tested that), which has been merged only to main as far as I can see.

This is the relevant log bit when run with the current NixOS 24.05 version:

[2024-08-12 19:49:51,435][ceph_volume.process][INFO  ] Running command: /run/current-system/sw/bin/cryptsetup --version
[2024-08-12 19:49:51,441][ceph_volume.process][INFO  ] stdout cryptsetup 2.7.3 flags: UDEV BLKID KEYRING KERNEL_CAPI HW_OPAL
[2024-08-12 19:49:51,441][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/main.py", line 46, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/activate.py", line 283, in main
    self.activate(args)
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/activate.py", line 211, in activate
    activate_bluestore(lvs, args.no_systemd, getattr(args, 'no_tmpfs', False))
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/activate.py", line 73, in activate_bluestore
    encryption_utils.set_dmcrypt_no_workqueue()
  File "/nix/store/7kmgdpba0g27lr90q3xh8ckzkwh4hk0f-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/util/encryption.py", line 22, in set_dmcrypt_no_workqueue
    if version.parse(out[0]) >= version.parse(f'cryptsetup {target_version}'):
       ^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/384k25grc8j05ngxfmjjia41np124jkw-python3-3.11.9-env/lib/python3.11/site-packages/packaging/version.py", line 54, in parse
    return Version(version)
           ^^^^^^^^^^^^^^^^
  File "/nix/store/384k25grc8j05ngxfmjjia41np124jkw-python3-3.11.9-env/lib/python3.11/site-packages/packaging/version.py", line 200, in __init__
    raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: 'cryptsetup 2.7.3 flags: UDEV BLKID KEYRING KERNEL_CAPI HW_OPAL '

(i.e. the version is parsed including the flags which breaks)

Notify maintainers

Metadata

% nix run nixpkgs#nix-info -- -m 
 - system: `"x86_64-linux"`
 - host os: `Linux 6.6.45, NixOS, 24.05 (Uakari), 24.05pre-git`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Lix, like Nix) 2.90.0`
 - nixpkgs: `/nix/store/yj1w6dachan720i1m4037jllrf2r5xrb-source`

Add a :+1: reaction to issues you find important.

nh2 commented 1 month ago

a backport for a commit that fixes this (supposedly, haven't tested that)

@benaryorg Could you test whether the patch fixes it?

Neither I nor any NixOS tests use ceph-volume, so it's not so easy to test for me.

If it fixes it for you, we can merge and backport a fix quickly.

Afterwards we should also write a NixOS VM test for this, so that you're protected of this via automation!

nh2 commented 1 month ago

@benaryorg For convenience of testing:

benaryorg commented 1 month ago

@nh2 I started doing some testing, but given how long Ceph takes to compile (don't get me started) I just shoved it into my hydra and stopped caring. Turns out I shouldn't have picked the linked 607eb34b2c278566c386efcbf3018629cf08ccfd from main, but instead 5df13b4197a10f0209a535a30ca9b9e5e6a12fdb which is the patch that was backported onto the reef branch (with no release yet), so the patch didn't apply and me doing something else for an hour while hydra idled around was for naught.

I just grabbed the code from your linked PR and shoved it into my overlay (testing this on 24.05 still), and the build process is looking good (IPv6 only) for now. Considering that building Ceph on my local 24 cores took half an hour, building with that hydra will likely take about two hours or more, so while I'll have my eyes on it for a bit to make sure it runs properly, I'll probably go to bed and get back to you in ~9h when it's built (at which point I'll be able to pull the exact build from hydra onto the server, making sure that the exact patches work).

benaryorg commented 1 month ago

Okay, the local 24 cores were a bit faster (the hashes still match so it is the same build that my hydra is still struggling with), but now I'm getting this error which is ever so slightly different, but still hints in the same direction:

[2024-08-13 04:23:30,566][ceph_volume.process][INFO  ] Running command: /run/current-system/sw/bin/cryptsetup --version
[2024-08-13 04:23:30,571][ceph_volume.process][INFO  ] stdout cryptsetup 2.7.3 flags: UDEV BLKID KEYRING KERNEL_CAPI HW_OPAL
[2024-08-13 04:23:30,572][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/main.py", line 46, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/activate.py", line 283, in main
    self.activate(args)
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/activate.py", line 211, in activate
    activate_bluestore(lvs, args.no_systemd, getattr(args, 'no_tmpfs', False))
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/devices/lvm/activate.py", line 73, in activate_bluestore
    encryption_utils.set_dmcrypt_no_workqueue()
  File "/nix/store/mp66xzqrhjy86jmbzm98lrbjcxcwk6iv-ceph-18.2.4/lib/python3.11/site-packages/ceph_volume-1.0.0-py3.11.egg/ceph_volume/util/encryption.py", line 54, in set_dmcrypt_no_workqueue
    raise RuntimeError('Error while checking cryptsetup version.\n',
RuntimeError: ('Error while checking cryptsetup version.\n', '`cryptsetup --version` output:\n', 'cryptsetup 2.7.3 flags: UDEV BLKID KEYRING KERNEL_CAPI HW_OPAL ')
benaryorg commented 1 month ago

Arghs, yes, it does need this patch additionally: https://github.com/ceph/ceph/commit/607eb34b2c278566c386efcbf3018629cf08ccfd

Otherwise it will still use .match() instead of .search() where the former requires the full string to match and the latter finds substrings. I'll pull in that patch too. Edit: as someone who is aware of computational complexity and backtracking in regular expressions, the "the regex pattern to more accurately capture version numbers" captures exactly the same version numbers but with a lot more backtracking and no word boundaries so it's literally the match/search difference that's relevant (although I appreciate the added tests). I just felt like I had to say that somewhere, because the old regex actually captured the pattern, the new one is barely more than [0-9.]+.

benaryorg commented 1 month ago

Grafana graphs from the prometheus metrics of Ceph showing the version numbers going from 18.2.1 to 18.2.4

diff --git a/config/host/haskell.home.bsocat.net/ceph.nix b/config/host/haskell.home.bsocat.net/ceph.nix
index 2106139..f860060 100644
--- a/config/host/haskell.home.bsocat.net/ceph.nix
+++ b/config/host/haskell.home.bsocat.net/ceph.nix
@@ -99,6 +99,43 @@
             builtins.listToAttrs
           ];

+      nixpkgs.overlays = lib.mkAfter
+      [
+        (final: prev: let
+          new_patches = [
+            # Fixes mgr not being able to import `packaging` due to autotools >= 70.
+            # Remove once https://github.com/ceph/ceph/pull/58624 is merged, see
+            # https://github.com/NixOS/nixpkgs/pull/330226#issuecomment-2268421031
+            (final.fetchpatch {
+              url = "https://github.com/ceph/ceph/commit/8da2d857fa8fdfedd7aad0ca90e1780a3ed085c9.patch";
+              name = "ceph-mgr-python-fix-packaging-import.patch";
+              hash = "sha256-3Yl1X6UfTf0XCXJxgRnM/Js9sz8tS+hsqViY6gDExoI=";
+            })
+
+            # Fixes cryptesetup version parsing regex, see
+            # * https://github.com/NixOS/nixpkgs/issues/334227
+            # * https://www.mail-archive.com/ceph-users@ceph.io/msg26309.html
+            # * https://github.com/ceph/ceph/pull/58997
+            # Remove once we're on the next version of Ceph 18, when this should be in:
+            # https://github.com/ceph/ceph/pull/58997
+            (final.fetchpatch {
+              url = "https://github.com/ceph/ceph/commit/6ae874902b63652fa199563b6e7950cd75151304.patch";
+              name = "ceph-reef-ceph-volume-fix-set_dmcrypt_no_workqueue.patch";
+              hash = "sha256-r+7hcCz2WF/rJfgKwTatKY9unJlE8Uw3fmOyaY5jVH0=";
+            })
+            (final.fetchpatch {
+              url = "https://github.com/ceph/ceph/commit/607eb34b2c278566c386efcbf3018629cf08ccfd.patch";
+              name = "ceph-reef-ceph-volume-fix-set_dmcrypt_no_workqueue-regex.patch";
+              hash = "sha256-q28Q7OIyFoMyMBCPXGA+AdNqp+9/6J/XwD4ODjx+JXY=";
+            })
+          ];
+        in
+          {
+            ceph = prev.ceph.overrideAttrs ({ patches ? [], ... }: { patches = patches ++ new_patches; });
+          }
+        )
+      ];
+
       benaryorg.prometheus.client.mocks.ceph =
       {
         port = 9283;

Looks like everything works with those three patches!

nh2 commented 1 month ago

Thanks for testing, I merged the fixes in PR #334275!

benaryorg commented 1 month ago

@nh2 looking at the current master branch patches I do not see the mentioned third patch required to make this work as mentioned in both of my prior comments. Am I missing something or was the third patch omitted by mistake?

nh2 commented 1 month ago

@benaryorg You're right, I missed the second patch, because the commit messages look so similar.

I'll make a PR to fix it.