STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.53k stars 430 forks source link

SEGFAULT in 1d_stencil_8 on Stampede #1418

Closed parsa closed 5 years ago

parsa commented 9 years ago

Changeset: 817963e1c74b10c3ee459c4f8455d0d6f470822e Arguments: --nx 100000 --np 20000 -t 16 Configuration: Debug, Release, RelWithDebInfo, Boost 1.55.0, 2 nodes, 1 locality per node, Stampede Location: hpx/lcos/promise.hpp:294 Message:

{stack-trace}: 8 frames:
0x2b8a7a13beea  : hpx::termination_handler(int) + 0x1ca in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x3d9ea0f710    : ??? + 0x3d9ea0f710 in /lib64/libpthread.so.0
0x2b8a7a430e36  : ??? + 0x2b8a7a430e36 in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x2b8a76864817  : ??? + 0x2b8a76864817 in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libiostreams.so.0
0x2b8a7a50660a  : hpx::components::server::runtime_support::call_shutdown_functions(bool) + 0xaa in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x2b8a7a538065  : ??? + 0x2b8a7a538065 in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x2b8a7a10a30b  : ??? + 0x2b8a7a10a30b in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
{what}: Segmentation fault

Stack trace:

#23 ?? () (at 0x0000000000000000)
#22 hpx::util::coroutines::detail::lx::trampoline (fun=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/util/detail/basic_function.hpp:223 (at 0x00002aaaae41330b)
#21 hpx::util::invoke_r(hpx::util::decay<hpx::actions::detail::continuation_thread_function<boost::mpl::if_<boost::is_same<hpx::components::server::runtime_support::call_shutdown_functions_action, hpx::actions::detail::this_type>, hpx::actions::action<struct <unnamed>, {(void (*)())hpx::components::server::runtime_support::call_shutdown_functions, 0L}, hpx::components::server::runtime_support::call_shutdown_functions_action>, hpx::components::server::runtime_support::call_shutdown_functions_action>::type, hpx::actions::basic_action<hpx::components::server::runtime_support, void (bool), boost::mpl::if_<boost::is_same<hpx::components::server::runtime_support::call_shutdown_functions_action, hpx::actions::detail::this_type>, hpx::actions::action<struct <unnamed>, {(void (*)())hpx::components::server::runtime_support::call_shutdown_functions, 0L}, hpx::components::server::runtime_support::call_shutdown_functions_action>, hpx::components::server::runtime_support::call_shutdown_functions_action>::type>::invoker, hpx::naming::address::address_type, hpx::util::tuple_element<0UL, hpx::actions::basic_action<hpx::components::server::runtime_support, void (bool), boost::mpl::if_<boost::is_same<hpx::components::server::runtime_support::call_shutdown_functions_action, hpx::actions::detail::this_type>, hpx::actions::action<struct <unnamed>, {(void (*)())hpx::components::server::runtime_support::call_shutdown_functions, 0L}, hpx::components::server::runtime_support::call_shutdown_functions_action>, hpx::components::server::runtime_support::call_shutdown_functions_action>::type>::arguments_type>::type> >::type &, enum hpx::threads::thread_state_ex_enum &) (f=..., vs=@0x1) at /scratch/03115/tg824139/hpx/repo/hpx/runtime/actions/component_action.hpp:66 (at 0x00002aaaae841065)
#20 hpx::components::server::runtime_support::call_shutdown_functions (this=0x2aabfb323860, pre_shutdown=true) at /scratch/03115/tg824139/hpx/repo/hpx/util/detail/basic_function.hpp:223 (at 0x00002aaaae80f60a)
#19 hpx::iostreams::detail::unregister_ostreams () at /scratch/03115/tg824139/hpx/repo/src/components/iostreams/component_module.cpp:54
#18 uninitialize () at /scratch/03115/tg824139/hpx/repo/hpx/components/iostreams/ostream.hpp:248
#17 free () at /scratch/03115/tg824139/hpx/repo/hpx/runtime/components/client_base.hpp:258
#16 operator= () at /scratch/03115/tg824139/hpx/repo/hpx/lcos/future.hpp:1086
#15 operator= () at /scratch/03115/tg824139/hpx/repo/hpx/lcos/future.hpp:503
#14 operator= () at /opt/apps/intel14/boost/1.55.0/x86_64/include/boost/smart_ptr/intrusive_ptr.hpp:121
#13 ~intrusive_ptr () at /opt/apps/intel14/boost/1.55.0/x86_64/include/boost/smart_ptr/intrusive_ptr.hpp:97
#12 intrusive_ptr_release () at /scratch/03115/tg824139/hpx/repo/hpx/lcos/detail/future_data.hpp:83 (at 0x00002aaaaab70817)
#11 hpx::lcos::detail::continuation<hpx::lcos::future<bool>, boost::disable_if_c<false, hpx::util::detail::bound<void (*)(hpx::lcos::future<hpx::naming::id_type> *, hpx::lcos::future<bool> *, hpx::lcos::promise<hpx::naming::id_type, hpx::naming::gid_type> *), hpx::util::tuple<hpx::util::detail::placeholder<1UL>, hpx::lcos::promise<hpx::naming::id_type, hpx::naming::gid_type> > > >::type, boost::detail::cpp0x_result_of_impl<void (boost::disable_if_c<false, hpx::util::detail::bound<void (*)(hpx::lcos::future<hpx::naming::id_type> *, hpx::lcos::future<bool> *, hpx::lcos::promise<hpx::naming::id_type, hpx::naming::gid_type> *), hpx::util::tuple<hpx::util::detail::placeholder<1UL>, hpx::lcos::promise<hpx::naming::id_type, hpx::naming::gid_type> > > >::type *, hpx::lcos::future<bool> *), void>::type>::~continuation(void) (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/lcos/local/packaged_continuation.hpp:150
#10 ~continuation (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/lcos/local/packaged_continuation.hpp:150
#9  ~bound (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/util/bind.hpp:357
#8  ~tuple (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/util/tuple.hpp:328
#7  ~tuple_impl (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/util/tuple.hpp:179
#6  ~tuple_member (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/util/tuple.hpp:69
#5  ~tuple_member (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/util/tuple.hpp:69
#4  ~promise (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/lcos/promise.hpp:580
#3  ~intrusive_ptr (this=0x2aabfb323860) at /opt/apps/intel14/boost/1.55.0/x86_64/include/boost/smart_ptr/intrusive_ptr.hpp:97
#2  intrusive_ptr_release (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/runtime/components/server/managed_component_base.hpp:346
#1  release (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/runtime/components/server/managed_component_base.hpp:147
#0 intrusive_ptr_release (this=0x2aabfb323860) at /scratch/03115/tg824139/hpx/repo/hpx/lcos/promise.hpp:294 (at 0x00002aaaae739e36)

Log:

TACC: Starting up job 4999616
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
Localities,OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,Time_Steps
2,     32,    33.002648771, 100000,               20000,                45                   
/agas{locality#0/total}/count/allocate,1,33.638671,[s],23
/agas{locality#1/total}/count/allocate,1,33.619783,[s],43
/agas{locality#0/total}/count/bind,1,33.639056,[s],80
/agas{locality#1/total}/count/bind,1,33.616230,[s],58
/agas{locality#0/total}/count/bind_gid,1,33.638627,[s],460099
/agas{locality#1/total}/count/bind_gid,1,33.616509,[s],460125
/agas{locality#0/total}/count/cache-evictions,1,33.638765,[s],13640
/agas{locality#1/total}/count/cache-evictions,1,33.616597,[s],0
/agas{locality#0/total}/count/cache-hits,1,33.639094,[s],16392
/agas{locality#1/total}/count/cache-hits,1,33.616293,[s],10387
/agas{locality#0/total}/count/cache-insertions,1,33.639068,[s],14664
/agas{locality#1/total}/count/cache-insertions,1,33.616459,[s],121
/agas{locality#0/total}/count/cache-misses,1,33.628842,[s],3.69028e+06
/agas{locality#1/total}/count/cache-misses,1,33.616522,[s],4.07135e+06
/agas{locality#0/total}/count/cache_erase_entry,1,33.628923,[s],60
/agas{locality#1/total}/count/cache_erase_entry,1,33.616397,[s],0
/agas{locality#0/total}/count/cache_get_entry,1,33.629187,[s],3.69205e+06
/agas{locality#1/total}/count/cache_get_entry,1,33.616510,[s],4.0816e+06
/agas{locality#0/total}/count/cache_insert_entry,1,33.629239,[s],14664
/agas{locality#1/total}/count/cache_insert_entry,1,33.616471,[s],121
/agas{locality#0/total}/count/cache_update_entry,1,33.638717,[s],14684
/agas{locality#1/total}/count/cache_update_entry,1,33.616421,[s],133
/agas{locality#0/total}/count/decrement_credit,1,33.639000,[s],2.29133e+06
/agas{locality#1/total}/count/decrement_credit,1,33.616351,[s],2.26067e+06
/agas{locality#0/total}/count/increment_credit,1,33.638689,[s],2
/agas{locality#1/total}/count/increment_credit,1,33.619656,[s],3
/agas{locality#0/total}/count/resolve,1,33.638921,[s],129
/agas{locality#1/total}/count/resolve,1,33.619784,[s],57
/agas{locality#0/total}/count/resolve_gid,1,33.629078,[s],2.31556e+06
/agas{locality#1/total}/count/resolve_gid,1,33.616559,[s],2.71142e+06
/agas{locality#0/total}/count/route,1,33.629294,[s],133
/agas{locality#1/total}/count/route,1,33.616300,[s],10154
/agas{locality#0/total}/count/unbind,1,33.639331,[s],1
/agas{locality#1/total}/count/unbind,1,33.619393,[s],0
/agas{locality#0/total}/count/unbind_gid,1,33.639330,[s],440069
/agas{locality#1/total}/count/unbind_gid,1,33.622753,[s],440104
/agas{locality#0/total}/primary/count,1,33.639437,[s],5.50728e+06
/agas{locality#1/total}/primary/count,1,33.619530,[s],5.88259e+06
/agas{locality#0/total}/primary/time,1,33.639676,[s],4.1616e+12,[ns]
/agas{locality#1/total}/primary/time,1,33.622782,[s],4.64811e+12,[ns]
/agas{locality#0/total}/symbol/count,1,33.639722,[s],217
/agas{locality#1/total}/symbol/count,1,33.622933,[s],116
/agas{locality#0/total}/symbol/time,1,33.639618,[s],1.852e+06,[ns]
/agas{locality#1/total}/symbol/time,1,33.619347,[s],1.55541e+06,[ns]
/agas{locality#0/total}/time/allocate,1,33.639850,[s],250527,[ns]
/agas{locality#1/total}/time/allocate,1,33.622829,[s],1.00435e+06,[ns]
/agas{locality#0/total}/time/bind,1,33.639685,[s],765384,[ns]
/agas{locality#1/total}/time/bind,1,33.619444,[s],505066,[ns]
/agas{locality#0/total}/time/bind_gid,1,33.629848,[s],6.86842e+11,[ns]
/agas{locality#1/total}/time/bind_gid,1,33.619913,[s],6.93509e+11,[ns]
/agas{locality#0/total}/time/cache_erase_entry,1,33.629912,[s],60,[ns]
/agas{locality#1/total}/time/cache_erase_entry,1,33.619595,[s],0,[ns]
/agas{locality#0/total}/time/cache_get_entry,1,33.639655,[s],3.69214e+06,[ns]
/agas{locality#1/total}/time/cache_get_entry,1,33.619834,[s],4.08167e+06,[ns]
/agas{locality#0/total}/time/cache_insert_entry,1,33.640118,[s],14664,[ns]
/agas{locality#1/total}/time/cache_insert_entry,1,33.619607,[s],121,[ns]
/agas{locality#0/total}/time/cache_update_entry,1,33.639703,[s],14684,[ns]
/agas{locality#1/total}/time/cache_update_entry,1,33.619703,[s],133,[ns]
/agas{locality#0/total}/time/decrement_credit,1,33.629987,[s],4.3547e+11,[ns]
/agas{locality#1/total}/time/decrement_credit,1,33.619666,[s],4.52338e+11,[ns]
/agas{locality#0/total}/time/increment_credit,1,33.639972,[s],613763,[ns]
/agas{locality#1/total}/time/increment_credit,1,33.623362,[s],4.58974e+06,[ns]
/agas{locality#0/total}/time/resolve,1,33.640357,[s],376618,[ns]
/agas{locality#1/total}/time/resolve,1,33.623243,[s],155852,[ns]
/agas{locality#0/total}/time/resolve_gid,1,33.630270,[s],2.96886e+12,[ns]
/agas{locality#1/total}/time/resolve_gid,1,33.619876,[s],3.42928e+12,[ns]
/agas{locality#0/total}/time/route,1,33.630229,[s],2.51027e+08,[ns]
/agas{locality#1/total}/time/route,1,33.623122,[s],2.78685e+08,[ns]
/agas{locality#0/total}/time/unbind,1,33.640068,[s],43751,[ns]
/agas{locality#1/total}/time/unbind,1,33.622953,[s],0,[ns]
/agas{locality#0/total}/time/unbind_gid,1,33.640451,[s],7.0174e+10,[ns]
/agas{locality#1/total}/time/unbind_gid,1,33.619697,[s],7.27063e+10,[ns]
/agas{locality#0/total}/count/bind_name,1,33.630561,[s],26
/agas{locality#0/total}/count/bind_prefix,1,33.640549,[s],64
/agas{locality#0/total}/component/count,1,33.640224,[s],90
/agas{locality#0/total}/component/time,1,33.640719,[s],561129,[ns]
/agas{locality#0/total}/count/free,1,33.640479,[s],0
/agas{locality#0/total}/count/localities,1,33.640193,[s],6
/agas{locality#0/total}/count/num_localities,1,33.630573,[s],180
/agas{locality#0/total}/count/num_localities_type,1,33.640378,[s],0
/agas{locality#0/total}/count/num_threads,1,33.640354,[s],3
/agas{locality#0/total}/count/resolve_id,1,33.630494,[s],0
/agas{locality#0/total}/count/resolve_locality,1,33.630481,[s],0
/agas{locality#0/total}/count/resolved_localities,1,33.640353,[s],0
/agas{locality#0/total}/count/unbind_name,1,33.630541,[s],0
/data{locality#0/total}/count/mpi/received,1,33.640568,[s],8.01062e+09,[bytes]
/data{locality#1/total}/count/mpi/received,1,33.620212,[s],8.0627e+06,[bytes]
/data{locality#0/total}/count/mpi/sent,1,33.640634,[s],8.05422e+06,[bytes]
/data{locality#1/total}/count/mpi/sent,1,33.620486,[s],1.06512e+07,[bytes]
/data{locality#0/total}/time/mpi/received,1,33.640787,[s],5.99811e+09,[ns]
/data{locality#1/total}/time/mpi/received,1,33.623443,[s],5.91665e+07,[ns]
/data{locality#0/total}/time/mpi/sent,1,33.640397,[s],2.33462e+08,[ns]
/data{locality#1/total}/time/mpi/sent,1,33.620100,[s],7.55896e+09,[ns]
/messages{locality#0/total}/count/mpi/received,1,33.640444,[s],31055
/messages{locality#1/total}/count/mpi/received,1,33.620321,[s],21024
/messages{locality#0/total}/count/mpi/sent,1,33.630929,[s],21013
/messages{locality#1/total}/count/mpi/sent,1,33.619794,[s],31119
/parcelport{locality#0/total}/count/mpi/cache-evictions,1,33.640566,[s],0
/parcelport{locality#1/total}/count/mpi/cache-evictions,1,33.620712,[s],0
/parcelport{locality#0/total}/count/mpi/cache-hits,1,33.640931,[s],0
/parcelport{locality#1/total}/count/mpi/cache-hits,1,33.620637,[s],0
/parcelport{locality#0/total}/count/mpi/cache-insertions,1,33.640837,[s],0
/parcelport{locality#1/total}/count/mpi/cache-insertions,1,33.624040,[s],0
/parcelport{locality#0/total}/count/mpi/cache-misses,1,33.641273,[s],0
/parcelport{locality#1/total}/count/mpi/cache-misses,1,33.620688,[s],0
/parcelport{locality#0/total}/count/mpi/cache-reclaims,1,33.640756,[s],0
/parcelport{locality#1/total}/count/mpi/cache-reclaims,1,33.620774,[s],0
/parcels{locality#0/total}/count/mpi/received,1,33.640807,[s],31055
/parcels{locality#1/total}/count/mpi/received,1,33.620343,[s],21024
/parcels{locality#0/total}/count/mpi/sent,1,33.641177,[s],21019
/parcels{locality#1/total}/count/mpi/sent,1,33.620318,[s],31128
/parcels{locality#0/total}/time/mpi/buffer_allocate/received,1,33.641484,[s],4.79685e+13,[ns]
/parcels{locality#1/total}/time/mpi/buffer_allocate/received,1,33.620210,[s],4.78738e+13,[ns]
/parcels{locality#0/total}/time/mpi/buffer_allocate/sent,1,33.641513,[s],4.79685e+13,[ns]
/parcels{locality#1/total}/time/mpi/buffer_allocate/sent,1,33.623670,[s],4.78738e+13,[ns]
/serialize{locality#0/total}/count/mpi/received,1,33.641573,[s],0,[bytes]
/serialize{locality#1/total}/count/mpi/received,1,33.620191,[s],0,[bytes]
/serialize{locality#0/total}/count/mpi/sent,1,33.641634,[s],8.0612e+06,[bytes]
/serialize{locality#1/total}/count/mpi/sent,1,33.620377,[s],8.01065e+09,[bytes]
/serialize{locality#0/total}/time/mpi/received,1,33.641440,[s],1.27835e+09,[ns]
/serialize{locality#1/total}/time/mpi/received,1,33.620252,[s],3.16496e+08,[ns]
/serialize{locality#0/total}/time/mpi/sent,1,33.641724,[s],5.43213e+08,[ns]
/serialize{locality#1/total}/time/mpi/sent,1,33.623601,[s],8.19503e+08,[ns]
{stack-trace}: 8 frames:
0x2b8a7a13beea  : hpx::termination_handler(int) + 0x1ca in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x3d9ea0f710    : ??? + 0x3d9ea0f710 in /lib64/libpthread.so.0
0x2b8a7a430e36  : ??? + 0x2b8a7a430e36 in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x2b8a76864817  : ??? + 0x2b8a76864817 in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libiostreams.so.0
0x2b8a7a50660a  : hpx::components::server::runtime_support::call_shutdown_functions(bool) + 0xaa in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x2b8a7a538065  : ??? + 0x2b8a7a538065 in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
0x2b8a7a10a30b  : ??? + 0x2b8a7a10a30b in /scratch/03115/tg824139/hpx/build/relwithdebinfo-intel/lib/libhpx.so.0
{what}: Segmentation fault
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=OFF
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_PARCELPORT_TCP=OFF
  HPX_PARCELPORT_MPI=ON (MPICH V3.1b1, MPI V3.0)
  HPX_PARCELPORT_IPC=OFF
  HPX_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_RUN_MAIN_EVERYWHERE=OFF
  HPX_LIMIT=5
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
  HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
  HPX_PREFIX (configured)=/scratch/03115/tg824139/hpx/install
  HPX_PREFIX=/scratch/03115/tg824139/hpx/build/relwithdebinfo-intel
{version}: V0.9.10-rc1 (AGAS: V3.0), Git: 817963e1c74b10c3ee459c4f8455d0d6f470822e
{boost}: V1.55.0
{build-type}: release
{date}: Mar 18 2015 12:15:33
{platform}: linux
{compiler}: Intel C++ C++0x mode version 1400
{stdlib}: GNU libstdc++ version 20120313
[c424-102.stampede.tacc.utexas.edu:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)
[c424-102.stampede.tacc.utexas.edu:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[c424-102.stampede.tacc.utexas.edu:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[c424-102.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process (rank: 1, pid: 45957) terminated with signal 6 -> abort job
[c423-504.stampede.tacc.utexas.edu:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node c424-102 aborted: Error while reading a PMI socket (4)
[c423-504.stampede.tacc.utexas.edu:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 7. MPI process died?
[c423-504.stampede.tacc.utexas.edu:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 7. MPI process died?
[c423-504.stampede.tacc.utexas.edu:mpispawn_0][handle_mt_peer] Error while reading PMI socket. MPI process died?
TACC: MPI job exited with code: 1

TACC: Shutdown complete. Exiting.
[c423-504.stampede.tacc.utexas.edu:mpispawn_0][report_error] connect() failed: Connection refused (111)
hkaiser commented 9 years ago

Please give us your command line as well

sithhell commented 9 years ago

Could be related to that warning:

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.

Could you try configuring with HPX_MALLOC=custom and try again please?

parsa commented 9 years ago

@hkaiser Full command-line is: ibrun $SCRATCH/hpx/build/relwithdebinfo-intel/bin/1d_stencil_8 --nx 100000 --np 20000 -t $SLURM_CPUS_ON_NODE --hpx:options-file cfg.ini, with cfg.ini being:

--hpx:print-counter=/agas{locality#*/total}/count/allocate
--hpx:print-counter=/agas{locality#*/total}/count/bind
--hpx:print-counter=/agas{locality#*/total}/count/bind_gid
--hpx:print-counter=/agas{locality#*/total}/count/cache-evictions
--hpx:print-counter=/agas{locality#*/total}/count/cache-hits
--hpx:print-counter=/agas{locality#*/total}/count/cache-insertions
--hpx:print-counter=/agas{locality#*/total}/count/cache-misses
--hpx:print-counter=/agas{locality#*/total}/count/cache_erase_entry
--hpx:print-counter=/agas{locality#*/total}/count/cache_get_entry
--hpx:print-counter=/agas{locality#*/total}/count/cache_insert_entry
--hpx:print-counter=/agas{locality#*/total}/count/cache_update_entry
--hpx:print-counter=/agas{locality#*/total}/count/decrement_credit
--hpx:print-counter=/agas{locality#*/total}/count/increment_credit
--hpx:print-counter=/agas{locality#*/total}/count/resolve
--hpx:print-counter=/agas{locality#*/total}/count/resolve_gid
--hpx:print-counter=/agas{locality#*/total}/count/route
--hpx:print-counter=/agas{locality#*/total}/count/unbind
--hpx:print-counter=/agas{locality#*/total}/count/unbind_gid
--hpx:print-counter=/agas{locality#*/total}/primary/count
--hpx:print-counter=/agas{locality#*/total}/primary/time
--hpx:print-counter=/agas{locality#*/total}/symbol/count
--hpx:print-counter=/agas{locality#*/total}/symbol/time
--hpx:print-counter=/agas{locality#*/total}/time/allocate
--hpx:print-counter=/agas{locality#*/total}/time/bind
--hpx:print-counter=/agas{locality#*/total}/time/bind_gid
--hpx:print-counter=/agas{locality#*/total}/time/cache_erase_entry
--hpx:print-counter=/agas{locality#*/total}/time/cache_get_entry
--hpx:print-counter=/agas{locality#*/total}/time/cache_insert_entry
--hpx:print-counter=/agas{locality#*/total}/time/cache_update_entry
--hpx:print-counter=/agas{locality#*/total}/time/decrement_credit
--hpx:print-counter=/agas{locality#*/total}/time/increment_credit
--hpx:print-counter=/agas{locality#*/total}/time/resolve
--hpx:print-counter=/agas{locality#*/total}/time/resolve_gid
--hpx:print-counter=/agas{locality#*/total}/time/route
--hpx:print-counter=/agas{locality#*/total}/time/unbind
--hpx:print-counter=/agas{locality#*/total}/time/unbind_gid
--hpx:print-counter=/agas{locality#0/total}/count/bind_name
--hpx:print-counter=/agas{locality#0/total}/count/bind_prefix
--hpx:print-counter=/agas{locality#0/total}/component/count
--hpx:print-counter=/agas{locality#0/total}/component/time
--hpx:print-counter=/agas{locality#0/total}/count/free
--hpx:print-counter=/agas{locality#0/total}/count/localities
--hpx:print-counter=/agas{locality#0/total}/count/num_localities
--hpx:print-counter=/agas{locality#0/total}/count/num_localities_type
--hpx:print-counter=/agas{locality#0/total}/count/num_threads
--hpx:print-counter=/agas{locality#0/total}/count/resolve_id
--hpx:print-counter=/agas{locality#0/total}/count/resolve_locality
--hpx:print-counter=/agas{locality#0/total}/count/resolved_localities
--hpx:print-counter=/agas{locality#0/total}/count/unbind_name
--hpx:print-counter=/data{locality#*/total}/count/mpi/received
--hpx:print-counter=/data{locality#*/total}/count/mpi/sent
--hpx:print-counter=/data{locality#*/total}/time/mpi/received
--hpx:print-counter=/data{locality#*/total}/time/mpi/sent
--hpx:print-counter=/messages{locality#*/total}/count/mpi/received
--hpx:print-counter=/messages{locality#*/total}/count/mpi/sent
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-evictions
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-hits
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-insertions
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-misses
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-reclaims
--hpx:print-counter=/parcels{locality#*/total}/count/mpi/received
--hpx:print-counter=/parcels{locality#*/total}/count/mpi/sent
--hpx:print-counter=/parcels{locality#*/total}/time/mpi/buffer_allocate/received
--hpx:print-counter=/parcels{locality#*/total}/time/mpi/buffer_allocate/sent
--hpx:print-counter=/serialize{locality#*/total}/count/mpi/received
--hpx:print-counter=/serialize{locality#*/total}/count/mpi/sent
--hpx:print-counter=/serialize{locality#*/total}/time/mpi/received
--hpx:print-counter=/serialize{locality#*/total}/time/mpi/sent
sithhell commented 9 years ago

The reason for this might be running out of memory. the grids alone take up 32 gigs of RAM. That makes 16 GB per node for the grids. Depending on how much memory MPI needs etc, it might easily eat up all available memory. Could you please check that?

parsa commented 9 years ago

Verified. It's not running out of memory. Added the stack trace above.

sithhell commented 8 years ago

Is this still a problem?

parsa commented 8 years ago

Don't have access to Stampede anymore... Cannot say anything

msimberg commented 6 years ago

Should this be closed? Is someone able to verify if this is still a problem?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed. Please re-open if necessary.