Closed drew-parsons closed 1 year ago
A sample backtrace looks like
(experimental_i386-dchroot)barriere$ mpiexec -n 2 ./demo_poisson -start_in_debugger
PETSC: Attaching gdb to ./demo_poisson of pid 5638 on display :0.0 on machine barriere
PETSC: Attaching gdb to ./demo_poisson of pid 5639 on display :0.0 on machine barriere
Unable to start debugger in xterm: No such file or directory
Unable to start debugger in xterm: No such file or directory
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in unknown file (null)
To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05638] *** Process received signal ***
[barriere:05638] Signal: Aborted (6)
[barriere:05638] Signal code: (-6)
[barriere:05638] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f32090]
[barriere:05638] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f32069]
[barriere:05638] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f00f36]
[barriere:05638] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5ee9312]
[barriere:05638] [ 4] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf653dd26]
[barriere:05638] [ 5] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf653a3b0]
[barriere:05638] [ 6] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf653e790]
[barriere:05638] [ 7] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf653e979]
[barriere:05638] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f32080]
[barriere:05638] [ 9] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xd7f)[0xf7ead68f]
[barriere:05638] [10] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ebeb9d]
[barriere:05638] [11] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ebece9]
[barriere:05638] [12] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7dff8b5]
[barriere:05638] [13] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7e96d63]
[barriere:05638] [14] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fdab)[0xf7dfedab]
[barriere:05638] [15] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7dff7c8]
[barriere:05638] [16] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7dff83f]
[barriere:05638] [17] ./demo_poisson(+0x19953)[0x565ed953]
[barriere:05638] [18] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5eeafd6]
[barriere:05638] [19] ./demo_poisson(+0x18451)[0x565ec451]
[barriere:05638] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in unknown file (null)
To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05639] *** Process received signal ***
[barriere:05639] Signal: Aborted (6)
[barriere:05639] Signal code: (-6)
[barriere:05639] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f66090]
[barriere:05639] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f66069]
[barriere:05639] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f34f36]
[barriere:05639] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5f1d312]
[barriere:05639] [ 4] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf6571d26]
[barriere:05639] [ 5] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf656e3b0]
[barriere:05639] [ 6] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf6572790]
[barriere:05639] [ 7] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf6572979]
[barriere:05639] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f66080]
[barriere:05639] [ 9] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x4cd6)[0xf1354cd6]
[barriere:05639] [10] /usr/lib/i386-linux-gnu/libopen-pal.so.40(opal_progress+0x30)[0xf4dcde70]
[barriere:05639] [11] /usr/lib/i386-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0xf4dd4a5d]
[barriere:05639] [12] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x236)[0xf7a1b2c6]
[barriere:05639] [13] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbb)[0xf7a73b2b]
[barriere:05639] [14] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0xf7)[0xf7a77b67]
[barriere:05639] [15] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_do_this+0x11d)[0xf11b28ed]
[barriere:05639] [16] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x99)[0xf11adca9]
[barriere:05639] [17] /usr/lib/i386-linux-gnu/libmpi.so.40(MPI_Alltoall+0x182)[0xf7a2f6d2]
[barriere:05639] [18] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx3MPI10all_to_allIxEENS_5graph13AdjacencyListIT_EEP19ompi_communicator_tRKS5_+0x15d)[0xf7eb95ed]
[barriere:05639] [19] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xdd6)[0xf7ee16e6]
[barriere:05639] [20] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ef2b9d]
[barriere:05639] [21] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ef2ce9]
[barriere:05639] [22] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7e338b5]
[barriere:05639] [23] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7ecad63]
[barriere:05639] [24] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fbb2)[0xf7e32bb2]
[barriere:05639] [25] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7e337c8]
[barriere:05639] [26] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7e3383f]
[barriere:05639] [27] ./demo_poisson(+0x19953)[0x5658d953]
[barriere:05639] [28] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5f1efd6]
[barriere:05639] [29] ./demo_poisson(+0x18451)[0x5658c451]
[barriere:05639] *** End of error message ***
Further debugging shows the error is in https://github.com/FEniCS/dolfinx/blob/02f35afa956ee2fc26284d529591c2589bf4d35e/cpp/dolfinx/mesh/graphbuild.cpp#L138 compute_nonlocal_dual_graph() in graphbuild.cpp.
In a two-process run in i386, the unmatched_facets loop here is skipped by one thread, accessed by the other thread. But values are
pos[dest]=0
max_num_vertices_per_facet=-1
so of course it's crashing on buffer[-1]
.
max_num_vertices_per_facet=-1
does not sound correct. It's set at https://github.com/FEniCS/dolfinx/blob/02f35afa956ee2fc26284d529591c2589bf4d35e/cpp/dolfinx/mesh/graphbuild.cpp#L97
With the explicit minus sign there, was buffer_global_min expected to have a negative value? Evidentally on i386 running 2 processes, it has buffer_global_min[0]=1
.
There are other python test failures on 32-bit machines, not certain if it's the same underlying problem. in C++ only demo_poisson_mpi is failing, while in python demo_helmholtz_2d.py, static-condensation-elasticity.py and demo_poisson.py all fail. See for example https://ci.debian.net/data/autopkgtest/testing/i386/f/fenics-dolfinx/16183257/log.gz
Python unit tests give other errors:
______________________________ test_cffi_assembly ______________________________
@skip_if_complex
def test_cffi_assembly():
mesh = UnitSquareMesh(MPI.COMM_WORLD, 13, 13)
V = FunctionSpace(mesh, ("Lagrange", 1))
...
ptrA = ffi.cast("intptr_t", ffi.addressof(lib, "tabulate_tensor_poissonA"))
integrals = {IntegralType.cell: ([(-1, ptrA)], None)}
> a = cpp.fem.Form([V._cpp_object, V._cpp_object], integrals, [], [], False)
E RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details)
and
______________________ test_compute_closest_entity_2d[0] _______________________
dim = 0
@pytest.mark.parametrize("dim", [0, 1, 2])
def test_compute_closest_entity_2d(dim):
p = numpy.array([-1.0, -0.01, 0.0])
mesh = UnitSquareMesh(MPI.COMM_WORLD, 15, 15)
tree = BoundingBoxTree(mesh, dim)
entity, distance = compute_closest_entity(tree, p, mesh)
...
entities = compute_collisions_point(tree, p_c)
...
if len(entities) > 0:
> assert numpy.isin(entity, entities)
E assert array(False)
E + where array(False) = <function isin at 0xed23d0b8>(0, array([134]))
E + where <function isin at 0xed23d0b8> = numpy.isin
python/test/unit/geometry/test_bounding_box_tree.py:295: AssertionError
The latter problem can be tested by hand (running the command manually. entities
contains the same element array([134])
found on amd64, but entity
gets set to 0
instead of 134
. It's possibly relevant that on i386 entities
is set without dtype,
array([134])
while on amd64 it gets a specific dtype,
array([134], dtype=int32)
@drew-parsons is this still happening for 0.7.0?
i'm waiting for debian to process dolfinx 0.7.0, will be able to say after that.
I guess the problem has cleared now. demo_poisson_mpi_2
and demo_poisson_mpi_3
are passing now. I'll run the next builds without skipping them.
There was an armel error in gjk similar to the one reported in https://github.com/FEniCS/dolfinx/issues/1104, but the tests mentioned here seem to be passing.
Great, thanks. We can close this one too then!
dolfinx 0.3.0 is failing to build on 32-bit architectures (i386, armhf, armel), see https://buildd.debian.org/status/package.php?p=fenics-dolfinx&suite=experimental e.g. i386 https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-3&stamp=1633022115&raw=0
There is a segfault, apparently triggered in openmpi's mca_btl_vader.so (backtrace reported at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995599 ).
The point of handover from dolfinx to MPI before the segfault is mesh/graphbuild.cpp l.143-144
https://github.com/FEniCS/dolfinx/blob/afbd8bd6361acdc226ef6f81b8d814e5914f504c/cpp/dolfinx/mesh/graphbuild.cpp#L143
Noting that the segfault is happening on 32-bit arches, and the dolfinx code is using int64_t to index the MPI buffers, could this be the origin of the segfault? Or would it more likely be some other bug in the OpenMPI implementation (in vader) ?