STEllAR-GROUP / octotiger

Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
http://octotiger.stellar-group.org/
Boost Software License 1.0
48 stars 17 forks source link

Octotiger segfault/floating point exception when running close_to_merger level 10 on Perlmutter #473

Closed JiakunYan closed 4 months ago

JiakunYan commented 6 months ago
{stack-trace}:
11 frames:
0x7f53cffee6c8  : hpx::termination_handler(int) [0x160] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/hpx-master-ysftyzq77diiduvnnq534apzokyzlhc5/lib64/libhpx_cored.so
0x7f53c39f0910  : /lib64/libpthread.so.0(+0x16910) [0x7f53c39f0910] in /lib64/libpthread.so.0
0x7f53d2ab1816  : octotiger::fmm::monopole_interactions::cuda_monopole_interaction_interface::compute_interactions(std::vector<double, std::allocator<double> >&, std::vector<std::shared_ptr<std::vector<space_vector_gen<double>, std::allocator<space_vector_gen<double> > > >, std::allocator<std::shared_ptr<std::vector<space_vector_gen<double>, std::allocator<space_vector_gen<double> > > > > >&, std::vector<neighbor_gravity_type, std::allocator<neighbor_gravity_type> >&, gsolve_type, double, std::array<bool, 26ul>&, std::shared_ptr<grid>&, bool) [0x326] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/octotiger-master-uvdy5v5p56wgishtglci3imtsi2oi62c/lib64/libhpx_octolib.so
0x7f53d2
acd07a  : octotiger::fmm::monopole_interactions::monopole_kernel_interface(std::vector<double, std::allocator<double> >&, std::vector<std::shared_ptr<std::vector<space_vector_gen<double>, std::allocator<space_vector_gen<double> > > >, std::allocator<std::shared_ptr<std::vector<space_vector_gen<double>, std::allocator<space_vector_gen<double> > > > > >&, std::vector<neighbor_gravity_type, std::allocator<neighbor_gravity_type> >&, gsolve_type, double, std::array<bool, 26ul>&, std::shared_ptr<grid>&, bool) [0xeba] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/octotiger-master-uvdy5v5p56wgishtglci3imtsi2oi62c/lib64/libhpx_octolib.so
0x7f53d289b6f0  : node_server::compute_fmm(gsolve_type, bool, bool) [0x2250] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/octotiger-master-uvdy5v5p56wgishtglci3imtsi2oi62c/lib64/libhpx_octolib.so
0x7f53d28b2beb  : node_server::solve_gravity(bool, bool) [0x13b] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sle
s15-zen3/gcc-11.2.0/octotiger-master-uvdy5v5p56wgishtglci3imtsi2oi62c/lib64/libhpx_octolib.so
0x7f53d28dbb90  : std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id> hpx::actions::detail::continuation_thread_function<node_server::solve_gravity_action>::operator()<hpx::threads::thread_restart_state, void>(hpx::threads::thread_restart_state) [0x210] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/octotiger-master-uvdy5v5p56wgishtglci3imtsi2oi62c/lib64/libhpx_octolib.so
0x7f53d28b849d  : std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id> hpx::util::detail::callable_vtable<std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id> (hpx::threads::thread_restart_state)>::_invoke<hpx::actions::detail::continuation_thread_function<node_server::solve_gravity_action> >(void*, hpx::threads::thread_restart_state&&) [0xd] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/octotiger-master-uvdy5v5p56wgishtglci3imt
si2oi62c/lib64/libhpx_octolib.so
0x7f53cff0c47c  : hpx::threads::coroutines::detail::coroutine_impl::operator()() [0xe0] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/hpx-master-ysftyzq77diiduvnnq534apzokyzlhc5/lib64/libhpx_cored.so
0x7f53cff0af38  : /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/hpx-master-ysftyzq77diiduvnnq534apzokyzlhc5/lib64/libhpx_cored.so(+0x374f38) [0x7f53cff0af38] in /global/u1/j/jackyan/workspace/spack/opt/spack/linux-sles15-zen3/gcc-11.2.0/hpx-master-ysftyzq77diiduvnnq534apzokyzlhc5/lib64/libhpx_cored.so
{what}: Floating point exception

So far I have only encountered this with LCI parcelport with 32 nodes/4 localities per node. MPI parcelport is fine with 32 nodes but I haven't tested a larger scale.

Not sure whether this is an issue in Octotiger or the parcelport layer, but I have implemented parcel-layer checksum and it seems all parcels are received correctly.

Not sure whether this is related to #471.

JiakunYan commented 6 months ago

Full config

{config}:
Core library:
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=tcmalloc
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_PREFIX (configured)=
  HPX_PREFIX=

  HPX_FILESYSTEM_WITH_BOOST_FILESYSTEM_COMPATIBILITY=OFF
  HPX_ITERATOR_SUPPORT_WITH_BOOST_ITERATOR_TRAVERSAL_TAG_COMPATIBILITY=OFF
  HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF
  HPX_WITH_APEX=OFF
  HPX_WITH_ASYNC_MPI=OFF
  HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF
  HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON

  HPX_WITH_COROUTINE_COUNTERS=OFF
  HPX_WITH_CUDA=ON
  HPX_WITH_DISTRIBUTED_RUNTIME=ON
  HPX_WITH_DYNAMIC_HPX_MAIN=ON
  HPX_WITH_IO_COUNTERS=ON
  HPX_WITH_IO_POOL=ON
  HPX_WITH_ITTNOTIFY=OFF
  HPX_WITH_LOGGING=ON
  HPX_WITH_NETWORKING=ON
  HPX_WITH_PAPI=OFF
  HPX_WITH_PARALLEL_TESTS_BIND_NONE=OFF
  HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF
  HPX_WITH_PARCELPORT_COUNTERS=OFF
  HPX_WITH_PARCELPORT_LCI=ON
  HPX_WITH_PARCELPORT_LCI_LOG=ON
  HPX_WITH_PARCELPORT_LIBFABRIC=OFF
  HPX_WITH_PARCELPORT_MPI=ON
  HP
X_WITH_PARCELPORT_MPI_MULTITHREADED=ON
  HPX_WITH_PARCELPORT_TCP=OFF
  HPX_WITH_PARCEL_PROFILING=OFF
  HPX_WITH_SANITIZERS=OFF
  HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF
  HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF
  HPX_WITH_STACKOVERFLOW_DETECTION=ON
  HPX_WITH_STACKTRACES=ON
  HPX_WITH_STACKTRACES_DEMANGLE_SYMBOLS=ON
  HPX_WITH_STACKTRACES_STATIC_SYMBOLS=OFF
  HPX_WITH_TESTS_DEBUG_LOG=OFF
  HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF

  HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF
  HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON
  HPX_WITH_THREAD_DEBUG_INFO=OFF
  HPX_WITH_THREAD_DESCRIPTION_FULL=OFF
  HPX_WITH_THREAD_GUARD_PAGE=ON
  HPX_WITH_THREAD_IDLE_RATES=OFF
  HPX_WITH_THREAD_LOCAL_STORAGE=OFF
  HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON
  HPX_WITH_THREAD_QUEUE_WAITTIME=OFF
  HPX_WITH_THREAD_STACK_MMAP=ON
  HPX_WITH_THREAD_STEALING_COUNTS=OFF
  HPX_WITH_THREAD_TARGET_ADDRESS=OFF
  HPX_WITH_TIMER_POOL=ON
  HPX_WITH_TUPLE_RVALUE_SWAP=ON
  HPX_WITH_VALGRIND=OFF
  HPX_WITH_VERIFY_LOCKS=ON
  HPX_WITH_VERIFY_LOCKS_BACKTRACE=
OFF

Module command_line_handling_local:
  HPX_COMMAND_LINE_HANDLING_WITH_JSON_CONFIGURATION_FILES=OFF

Module coroutines:
  HPX_COROUTINES_WITH_SWAP_CONTEXT_EMULATION=OFF

Module datastructures:
  HPX_DATASTRUCTURES_WITH_ADAPT_STD_TUPLE=OFF
  HPX_DATASTRUCTURES_WITH_ADAPT_STD_VARIANT=OFF

Module logging:
  HPX_LOGGING_WITH_SEPARATE_DESTINATIONS=ON

Module serialization:
  HPX_SERIALIZATION_WITH_ALLOW_CONST_TUPLE_MEMBERS=OFF
  HPX_SERIALIZATION_WITH_ALLOW_RAW_POINTER_SERIALIZATION=OFF

  HPX_SERIALIZATION_WITH_ALL_TYPES_ARE_BITWISE_SERIALIZABLE=OFF
  HPX_SERIALIZATION_WITH_BOOST_TYPES=OFF
  HPX_SERIALIZATION_WITH_SUPPORTS_ENDIANESS=OFF

Module topology:
  HPX_TOPOLOGY_WITH_ADDITIONAL_HWLOC_TESTING=OFF

{version}: V1.10.0-trunk (AGAS: V3.0), Git: unknown
{boost}: V1.82.0
{build-type}: debug

{date}: Dec 25 2023 09:47:47
{platform}: linux
{compiler}: GNU C++ version 11.2.0 20210728 (Cray Inc.)
{stdlib}: GNU libstdc++ version 20210728