bazaah / aur-ceph

Public workspace for ceph packages in the Archlinux AUR
https://aur.archlinux.org/pkgbase/ceph
6 stars 1 forks source link

Rebuild for v18.2.x #16

Closed bazaah closed 1 year ago

bazaah commented 1 year ago

This is a tracking issue for me to collect my thoughts / notes around the push for v18

Most notably, I likely will not be actually pushing a v18 build to AUR for at least a few patch versions, as typically Ceph finds+fixes serious issues shortly after a public release of the new version.


v18.1.x (TEST)


v18.2.x (RELEASE)


Experiments

Fixes

Tests

bazaah commented 1 year ago

Got my first successful build of v18.1.1 with the moral equivalent of https://github.com/ceph/ceph/pull/52119 + https://github.com/ceph/ceph/pull/51737, minus lots of intree patches that are no longer relevant

bazaah commented 1 year ago

Check failed, but not too worried about that for the moment

bazaah commented 1 year ago

Switching to ninja seems to trigger some sort of infinite loop in the build somewhere, continuously reading something. Not sure what is going on, but leaving that alone for now

bazaah commented 1 year ago

First stable version has released: https://github.com/ceph/ceph/tree/v18.2.0

bazaah commented 1 year ago

Ran into a fmt compile error, seems I need to implement the fmtlib specialization for ceph_le<T>. Need to investigate this some more, maybe find prior art I can use

bazaah commented 1 year ago

ran into a lot of fmt compile errors this weekend. Got to 81% in the build, but more work yet to be done.

bazaah commented 1 year ago

first ever successful build of v18.2 just completed. Likely there going to be lots of "fun" tests to fix, but I'm happy to say that a build with -DWITH_RBD_RWL=ON completed.

bazaah commented 1 year ago
The following tests FAILED:
         10 - run-tox-mgr-dashboard-lint (Failed)
         22 - run-tox-cephadm (Failed)
        142 - check-generated.sh (Failed)
        161 - unittest_erasure_code_shec_arguments (Failed)
        179 - unittest_bluefs (Subprocess aborted)

The last two are the most troubling. The first two seem like entirely failed lints (from the newer pylint), and the 3rd I'm not sure of yet

bazaah commented 1 year ago

I have fixes for check-generated.sh, and "fixed" (re-disabled) the lints in the first two.

However, I think the last two are serious, and caused by some change either in gcc or boost. Need more time to investigate them.

bazaah commented 1 year ago

As a side note, if anyone else is interested I've pushed a cleaned up patch for the fmtlib fixes in https://github.com/bazaah/aur-ceph/commit/aa4476ac3a7ba726972c7ec0258e032655355de7, so you can now build v18.2.0 from the feature/v18.2.0-1 branch yourself.

bazaah commented 1 year ago

So, either https://github.com/ceph/ceph/commit/844260f3a2a065298c94ceee8c1d9774fdbf825d or https://github.com/ceph/ceph/commit/25951434666c339e310df8fe2d1b0dd651d28fff cause the regression in unittest_erasure_code_shec_arguments. Unsure which; and maybe its both somehow. Confirmed to be the second, not sure what the issue is, yet


EDIT:

iff --git a/src/test/erasure-code/TestErasureCodeShec_arguments.cc b/src/test/erasure-code/TestErasureCodeShec_arguments.cc
index 075c6383eed..74403eaf6ed 100644
--- a/src/test/erasure-code/TestErasureCodeShec_arguments.cc
+++ b/src/test/erasure-code/TestErasureCodeShec_arguments.cc
@@ -86,12 +86,12 @@ void create_table_shec432() {
           continue;
         }
         if (std::popcount(avails) == 4) {
-         auto a = to_array<std::initializer_list<int>>({
+         std::vector<std::initializer_list<int>> a = {
              {0,1,2,3}, {0,1,2,4}, {0,1,2,6}, {0,1,3,4}, {0,1,3,6}, {0,1,4,6},
              {0,2,3,4}, {0,2,3,5}, {0,2,4,5}, {0,2,4,6}, {0,2,5,6}, {0,3,4,5},
              {0,3,4,6}, {0,3,5,6}, {0,4,5,6}, {1,2,3,4}, {1,2,3,5}, {1,2,4,5},
              {1,2,4,6}, {1,2,5,6}, {1,3,4,5}, {1,3,4,6}, {1,3,5,6}, {1,4,5,6},
-             {2,3,4,5}, {2,4,5,6}, {3,4,5,6}});
+             {2,3,4,5}, {2,4,5,6}, {3,4,5,6}};
           if (ranges::any_of(a, std::bind_front(cmp_equal<uint, int>, avails),
                             getint)) {
            vec.push_back(avails);

As it turns out, trying to cast an std::initializer_list to an array is undefined behavior. std::vector actually has a constructor for this, so use it instead.

bazaah commented 1 year ago

Promising solution in https://tracker.ceph.com/issues/58759 for unittest_bluefs

bazaah commented 1 year ago

Right, I'm moving to integration testing (= upgrading from v17 + standing up a new v18 cluster).

bazaah commented 1 year ago

Found this issue https://github.com/pyca/cryptography/issues/9016, and it seems to be a problem beyond ceph: somehow python-cryptography (and other modules?) are attempting to initialize the rust bindings (?) multiple times which has been disallowed for soundness (?) reasons.

bazaah commented 1 year ago

I don't know if this is even fixable on my end as it doesn't seem to be a ceph specific issue. I'd have to completely isolate the python stack (e.g build + somehow run in a venv)

bazaah commented 1 year ago

The NOTIFY_TYPES messages seem legit... the modules don't define an attr like that in v18.2.0, mostly. It also doesn't seem to technically be an issue, as the code that checks for this has it's error ignored... so not sure what's up.

bazaah commented 1 year ago

Found this issue pyca/cryptography#9016, and it seems to be a problem beyond ceph: somehow python-cryptography (and other modules?) are attempting to initialize the rust bindings (?) multiple times which has been disallowed for soundness (?) reasons.

Following up on this, it seems the PyO3 maintainer has effectively decided to flat out restrict usage of PyO3 modules in embedded / multi interpreter contexts, like exists in the ceph-mgr machinery, per https://github.com/PyO3/pyo3/discussions/2346#discussioncomment-3246505. This is somewhat irritating and effectively turns any module with PyO3 in its dep tree in a bomb.

bazaah commented 1 year ago

So. I'm likely going to lift all of this context into its own issue, and move forward with the v18 release, as I do not see a realistic method for fixing this myself.

I'd need to (either):

  1. Remove python-cryptography from all mgr related python code
  2. Create and maintain an extension to the ceph build for a stable venv, and ensure we use old enough versions of the affected modules to avoid hitting https://github.com/PyO3/pyo3/commit/78ba70d2b4cdae1228561700bab62da793801d18
  3. Somehow work with the PyO3 maintainer (or python-cryptography) to fix this on their end

1 and 2 ultimately run into the same issue: eventually I will be forced to upgrade to some version of something that depends on >=0.17.0 of PyO3, and 3 seems untenable (https://github.com/PyO3/pyo3/discussions/2346#discussioncomment-2911159):

... The extensive redesign seems intractable ...