FEniCS / dolfinx

Next generation FEniCS problem solving environment
https://fenicsproject.org
GNU Lesser General Public License v3.0
760 stars 182 forks source link

build fails on 32-bit architectures: compute_nonlocal_dual_graph: max_num_vertices_per_facet=-1 #1735

Closed drew-parsons closed 1 year ago

drew-parsons commented 3 years ago

dolfinx 0.3.0 is failing to build on 32-bit architectures (i386, armhf, armel), see https://buildd.debian.org/status/package.php?p=fenics-dolfinx&suite=experimental e.g. i386 https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-3&stamp=1633022115&raw=0

There is a segfault, apparently triggered in openmpi's mca_btl_vader.so (backtrace reported at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995599 ).

The point of handover from dolfinx to MPI before the segfault is mesh/graphbuild.cpp l.143-144

  graph::AdjacencyList<std::int64_t> recvd_buffer
      = dolfinx::MPI::all_to_all(comm, send_buffer);

https://github.com/FEniCS/dolfinx/blob/afbd8bd6361acdc226ef6f81b8d814e5914f504c/cpp/dolfinx/mesh/graphbuild.cpp#L143

Noting that the segfault is happening on 32-bit arches, and the dolfinx code is using int64_t to index the MPI buffers, could this be the origin of the segfault? Or would it more likely be some other bug in the OpenMPI implementation (in vader) ?

drew-parsons commented 3 years ago

A sample backtrace looks like

(experimental_i386-dchroot)barriere$ mpiexec -n 2 ./demo_poisson -start_in_debugger 
PETSC: Attaching gdb to ./demo_poisson of pid 5638 on display :0.0 on machine barriere
PETSC: Attaching gdb to ./demo_poisson of pid 5639 on display :0.0 on machine barriere
Unable to start debugger in xterm: No such file or directory
Unable to start debugger in xterm: No such file or directory
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in  unknown file (null)
  To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05638] *** Process received signal ***
[barriere:05638] Signal: Aborted (6)
[barriere:05638] Signal code:  (-6)
[barriere:05638] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f32090]
[barriere:05638] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f32069]
[barriere:05638] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f00f36]
[barriere:05638] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5ee9312]
[barriere:05638] [ 4] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf653dd26]
[barriere:05638] [ 5] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf653a3b0]
[barriere:05638] [ 6] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf653e790]
[barriere:05638] [ 7] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf653e979]
[barriere:05638] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f32080]
[barriere:05638] [ 9] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xd7f)[0xf7ead68f]
[barriere:05638] [10] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ebeb9d]
[barriere:05638] [11] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ebece9]
[barriere:05638] [12] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7dff8b5]
[barriere:05638] [13] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7e96d63]
[barriere:05638] [14] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fdab)[0xf7dfedab]
[barriere:05638] [15] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7dff7c8]
[barriere:05638] [16] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7dff83f]
[barriere:05638] [17] ./demo_poisson(+0x19953)[0x565ed953]
[barriere:05638] [18] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5eeafd6]
[barriere:05638] [19] ./demo_poisson(+0x18451)[0x565ec451]
[barriere:05638] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in  unknown file (null)
  To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05639] *** Process received signal ***
[barriere:05639] Signal: Aborted (6)
[barriere:05639] Signal code:  (-6)
[barriere:05639] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f66090]
[barriere:05639] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f66069]
[barriere:05639] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f34f36]
[barriere:05639] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5f1d312]
[barriere:05639] [ 4] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf6571d26]
[barriere:05639] [ 5] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf656e3b0]
[barriere:05639] [ 6] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf6572790]
[barriere:05639] [ 7] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf6572979]
[barriere:05639] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f66080]
[barriere:05639] [ 9] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x4cd6)[0xf1354cd6]
[barriere:05639] [10] /usr/lib/i386-linux-gnu/libopen-pal.so.40(opal_progress+0x30)[0xf4dcde70]
[barriere:05639] [11] /usr/lib/i386-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0xf4dd4a5d]
[barriere:05639] [12] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x236)[0xf7a1b2c6]
[barriere:05639] [13] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbb)[0xf7a73b2b]
[barriere:05639] [14] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0xf7)[0xf7a77b67]
[barriere:05639] [15] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_do_this+0x11d)[0xf11b28ed]
[barriere:05639] [16] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x99)[0xf11adca9]
[barriere:05639] [17] /usr/lib/i386-linux-gnu/libmpi.so.40(MPI_Alltoall+0x182)[0xf7a2f6d2]
[barriere:05639] [18] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx3MPI10all_to_allIxEENS_5graph13AdjacencyListIT_EEP19ompi_communicator_tRKS5_+0x15d)[0xf7eb95ed]
[barriere:05639] [19] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xdd6)[0xf7ee16e6]
[barriere:05639] [20] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ef2b9d]
[barriere:05639] [21] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ef2ce9]
[barriere:05639] [22] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7e338b5]
[barriere:05639] [23] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7ecad63]
[barriere:05639] [24] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fbb2)[0xf7e32bb2]
[barriere:05639] [25] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7e337c8]
[barriere:05639] [26] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7e3383f]
[barriere:05639] [27] ./demo_poisson(+0x19953)[0x5658d953]
[barriere:05639] [28] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5f1efd6]
[barriere:05639] [29] ./demo_poisson(+0x18451)[0x5658c451]
[barriere:05639] *** End of error message ***
drew-parsons commented 3 years ago

Further debugging shows the error is in https://github.com/FEniCS/dolfinx/blob/02f35afa956ee2fc26284d529591c2589bf4d35e/cpp/dolfinx/mesh/graphbuild.cpp#L138 compute_nonlocal_dual_graph() in graphbuild.cpp.

In a two-process run in i386, the unmatched_facets loop here is skipped by one thread, accessed by the other thread. But values are

pos[dest]=0
max_num_vertices_per_facet=-1

so of course it's crashing on buffer[-1].

max_num_vertices_per_facet=-1 does not sound correct. It's set at https://github.com/FEniCS/dolfinx/blob/02f35afa956ee2fc26284d529591c2589bf4d35e/cpp/dolfinx/mesh/graphbuild.cpp#L97 With the explicit minus sign there, was buffer_global_min expected to have a negative value? Evidentally on i386 running 2 processes, it has buffer_global_min[0]=1.

drew-parsons commented 2 years ago

There are other python test failures on 32-bit machines, not certain if it's the same underlying problem. in C++ only demo_poisson_mpi is failing, while in python demo_helmholtz_2d.py, static-condensation-elasticity.py and demo_poisson.py all fail. See for example https://ci.debian.net/data/autopkgtest/testing/i386/f/fenics-dolfinx/16183257/log.gz

Python unit tests give other errors:

______________________________ test_cffi_assembly ______________________________

    @skip_if_complex
    def test_cffi_assembly():
        mesh = UnitSquareMesh(MPI.COMM_WORLD, 13, 13)
        V = FunctionSpace(mesh, ("Lagrange", 1))    
...
        ptrA = ffi.cast("intptr_t", ffi.addressof(lib, "tabulate_tensor_poissonA"))
        integrals = {IntegralType.cell: ([(-1, ptrA)], None)}
>       a = cpp.fem.Form([V._cpp_object, V._cpp_object], integrals, [], [], False)
E       RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details)

and

______________________ test_compute_closest_entity_2d[0] _______________________

dim = 0

    @pytest.mark.parametrize("dim", [0, 1, 2])
    def test_compute_closest_entity_2d(dim):
        p = numpy.array([-1.0, -0.01, 0.0])
        mesh = UnitSquareMesh(MPI.COMM_WORLD, 15, 15)
        tree = BoundingBoxTree(mesh, dim)
        entity, distance = compute_closest_entity(tree, p, mesh)
...
        entities = compute_collisions_point(tree, p_c)
...
        if len(entities) > 0:
>           assert numpy.isin(entity, entities)
E           assert array(False)
E            +  where array(False) = <function isin at 0xed23d0b8>(0, array([134]))
E            +    where <function isin at 0xed23d0b8> = numpy.isin

python/test/unit/geometry/test_bounding_box_tree.py:295: AssertionError

The latter problem can be tested by hand (running the command manually. entities contains the same element array([134]) found on amd64, but entity gets set to 0 instead of 134. It's possibly relevant that on i386 entities is set without dtype,

array([134])

while on amd64 it gets a specific dtype,

array([134], dtype=int32)
francesco-ballarin commented 1 year ago

@drew-parsons is this still happening for 0.7.0?

drew-parsons commented 1 year ago

i'm waiting for debian to process dolfinx 0.7.0, will be able to say after that.

drew-parsons commented 1 year ago

I guess the problem has cleared now. demo_poisson_mpi_2 and demo_poisson_mpi_3 are passing now. I'll run the next builds without skipping them.

drew-parsons commented 1 year ago

There was an armel error in gjk similar to the one reported in https://github.com/FEniCS/dolfinx/issues/1104, but the tests mentioned here seem to be passing.

francesco-ballarin commented 1 year ago

Great, thanks. We can close this one too then!