Closed bazaah closed 1 year ago
Got my first successful build of v18.1.1 with the moral equivalent of https://github.com/ceph/ceph/pull/52119 + https://github.com/ceph/ceph/pull/51737, minus lots of intree patches that are no longer relevant
Check failed, but not too worried about that for the moment
Switching to ninja seems to trigger some sort of infinite loop in the build somewhere, continuously reading something. Not sure what is going on, but leaving that alone for now
First stable version has released: https://github.com/ceph/ceph/tree/v18.2.0
Ran into a fmt compile error, seems I need to implement the fmtlib specialization for ceph_le<T>
. Need to investigate this some more, maybe find prior art I can use
ran into a lot of fmt compile errors this weekend. Got to 81% in the build, but more work yet to be done.
first ever successful build of v18.2 just completed. Likely there going to be lots of "fun" tests to fix, but I'm happy to say that a build with -DWITH_RBD_RWL=ON
completed.
The following tests FAILED:
10 - run-tox-mgr-dashboard-lint (Failed)
22 - run-tox-cephadm (Failed)
142 - check-generated.sh (Failed)
161 - unittest_erasure_code_shec_arguments (Failed)
179 - unittest_bluefs (Subprocess aborted)
The last two are the most troubling. The first two seem like entirely failed lints (from the newer pylint), and the 3rd I'm not sure of yet
I have fixes for check-generated.sh
, and "fixed" (re-disabled) the lints in the first two.
However, I think the last two are serious, and caused by some change either in gcc or boost. Need more time to investigate them.
As a side note, if anyone else is interested I've pushed a cleaned up patch for the fmtlib fixes in https://github.com/bazaah/aur-ceph/commit/aa4476ac3a7ba726972c7ec0258e032655355de7, so you can now build v18.2.0 from the feature/v18.2.0-1
branch yourself.
So, either https://github.com/ceph/ceph/commit/844260f3a2a065298c94ceee8c1d9774fdbf825d or https://github.com/ceph/ceph/commit/25951434666c339e310df8fe2d1b0dd651d28fff cause the regression in unittest_erasure_code_shec_arguments
. Unsure which; and maybe its both somehow. Confirmed to be the second, not sure what the issue is, yet
EDIT:
iff --git a/src/test/erasure-code/TestErasureCodeShec_arguments.cc b/src/test/erasure-code/TestErasureCodeShec_arguments.cc
index 075c6383eed..74403eaf6ed 100644
--- a/src/test/erasure-code/TestErasureCodeShec_arguments.cc
+++ b/src/test/erasure-code/TestErasureCodeShec_arguments.cc
@@ -86,12 +86,12 @@ void create_table_shec432() {
continue;
}
if (std::popcount(avails) == 4) {
- auto a = to_array<std::initializer_list<int>>({
+ std::vector<std::initializer_list<int>> a = {
{0,1,2,3}, {0,1,2,4}, {0,1,2,6}, {0,1,3,4}, {0,1,3,6}, {0,1,4,6},
{0,2,3,4}, {0,2,3,5}, {0,2,4,5}, {0,2,4,6}, {0,2,5,6}, {0,3,4,5},
{0,3,4,6}, {0,3,5,6}, {0,4,5,6}, {1,2,3,4}, {1,2,3,5}, {1,2,4,5},
{1,2,4,6}, {1,2,5,6}, {1,3,4,5}, {1,3,4,6}, {1,3,5,6}, {1,4,5,6},
- {2,3,4,5}, {2,4,5,6}, {3,4,5,6}});
+ {2,3,4,5}, {2,4,5,6}, {3,4,5,6}};
if (ranges::any_of(a, std::bind_front(cmp_equal<uint, int>, avails),
getint)) {
vec.push_back(avails);
As it turns out, trying to cast an std::initializer_list
to an array is undefined behavior. std::vector
actually has a constructor for this, so use it instead.
Promising solution in https://tracker.ceph.com/issues/58759 for unittest_bluefs
Right, I'm moving to integration testing (= upgrading from v17 + standing up a new v18 cluster).
Found this issue https://github.com/pyca/cryptography/issues/9016, and it seems to be a problem beyond ceph: somehow python-cryptography (and other modules?) are attempting to initialize the rust bindings (?) multiple times which has been disallowed for soundness (?) reasons.
I don't know if this is even fixable on my end as it doesn't seem to be a ceph specific issue. I'd have to completely isolate the python stack (e.g build + somehow run in a venv)
The NOTIFY_TYPES
messages seem legit... the modules don't define an attr like that in v18.2.0
, mostly. It also doesn't seem to technically be an issue, as the code that checks for this has it's error ignored... so not sure what's up.
Found this issue pyca/cryptography#9016, and it seems to be a problem beyond ceph: somehow python-cryptography (and other modules?) are attempting to initialize the rust bindings (?) multiple times which has been disallowed for soundness (?) reasons.
Following up on this, it seems the PyO3 maintainer has effectively decided to flat out restrict usage of PyO3 modules in embedded / multi interpreter contexts, like exists in the ceph-mgr machinery, per https://github.com/PyO3/pyo3/discussions/2346#discussioncomment-3246505. This is somewhat irritating and effectively turns any module with PyO3 in its dep tree in a bomb.
So. I'm likely going to lift all of this context into its own issue, and move forward with the v18 release, as I do not see a realistic method for fixing this myself.
I'd need to (either):
python-cryptography
from all mgr related python code1 and 2 ultimately run into the same issue: eventually I will be forced to upgrade to some version of something that depends on >=0.17.0 of PyO3, and 3 seems untenable (https://github.com/PyO3/pyo3/discussions/2346#discussioncomment-2911159):
... The extensive redesign seems intractable ...
This is a tracking issue for me to collect my thoughts / notes around the push for v18
Most notably, I likely will not be actually pushing a v18 build to AUR for at least a few patch versions, as typically Ceph finds+fixes serious issues shortly after a public release of the new version.
v18.1.x (TEST)
Checkv18.2.x (RELEASE)
Experiments
cmake -G
(upstream has since v17.2.5)-DWITH_RBD_RWL=ON
for the writeback cache (#18)Fixes
lvm2
inpkg.ceph-volume
(1)pkg.ceph-mgr
(1, 2)/etc/sudoers.d/
fakeroot perms (1)Tests