Closed drew-parsons closed 6 months ago
I feel like this isn't properly solvable without effort upstream in ADIOS2 to test on big endian architectures. It's always going to be unreliable without testing there.
Perhaps we should disable building against ADIOS2 on big endian systems? Detecting big endianness is easy with C++20's https://en.cppreference.com/w/cpp/types/endian, or with CMake https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_BYTE_ORDER.html.
I disabled the affected tests that way with https://salsa.debian.org/science-team/fenics/fenics-dolfinx/-/blob/master/debian/patches/big_endian_skip_adios2_tests.patch?ref_type=heads
There's also sys.byteorder
in Python. That's enough to get the test suite passing on s390x.
I'd suggest doing it that way, disabling at the level of the tests rather than the library itself, and document it with a comment somewhere to acknowledge it's not supported. That way anyone actually using big-endian can choose to run VTXWriter knowing that it's not expected to work, and can generate and take a closer look at backtraces. I figure anyone using big-endian will know they have a problematic architecture in general.
Closing since it's an upstream issue.
Summarize the issue
Adios2 has a difficult relationship with big-endian systems (such as s390x, ppc64). It builds fine, but upstream notes that it is not really tested (though it does pass "most" tests). It builds fine with a default build configuration, but then adios4dolfinx tests cause adios2 to confess it should be build with
-DADIOS2_USE_Endian_Reverse=ON
, which is fair enough.After building adios2 on s390x with
-DADIOS2_USE_Endian_Reverse=ON
, some of the dolfinx demos using VTXWriter fail. This is not surprising. As just noted, adios2 has not been tested upstream for big-endian support. What is surprising is that some demos do not fail.The existence of failing and non-failing demos suggests it might be possible to identify (and fix) the trigger for failure.
In debian build 1:0.7.3-5 I've patched the demos to skip the failing demos on big-endian systems. I give sample output below documenting the segfault without that patch, together with the log of passing demos, extracted from https://ci.debian.net/data/autopkgtest/unstable/s390x/f/fenics-dolfinx/43107848/log.gz
The python demos that fails using VTXWriter is
The python demos that apparently pass using VTXWriter are
Another strange observation is that C++ demo_poisson fails in a similar way, yet the corresponding python demo demo_poisson.py passes without segfault.
The python unit tests io/test_adios2.py all pass.
How to reproduce the bug
-DADIOS2_USE_Endian_Reverse=ON
Minimal Example (Python)
No response
Output (Python)
Version
0.7.3
DOLFINx git commit
No response
Installation
Debian packages.
Additional information
Adios2 2.9.2+dfsg1-13 with patches supporting BP5.