Closed parsa closed 5 years ago
Please give us your command line as well
Could be related to that warning:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
Could you try configuring with HPX_MALLOC=custom and try again please?
@hkaiser Full command-line is: ibrun $SCRATCH/hpx/build/relwithdebinfo-intel/bin/1d_stencil_8 --nx 100000 --np 20000 -t $SLURM_CPUS_ON_NODE --hpx:options-file cfg.ini
, with cfg.ini
being:
--hpx:print-counter=/agas{locality#*/total}/count/allocate
--hpx:print-counter=/agas{locality#*/total}/count/bind
--hpx:print-counter=/agas{locality#*/total}/count/bind_gid
--hpx:print-counter=/agas{locality#*/total}/count/cache-evictions
--hpx:print-counter=/agas{locality#*/total}/count/cache-hits
--hpx:print-counter=/agas{locality#*/total}/count/cache-insertions
--hpx:print-counter=/agas{locality#*/total}/count/cache-misses
--hpx:print-counter=/agas{locality#*/total}/count/cache_erase_entry
--hpx:print-counter=/agas{locality#*/total}/count/cache_get_entry
--hpx:print-counter=/agas{locality#*/total}/count/cache_insert_entry
--hpx:print-counter=/agas{locality#*/total}/count/cache_update_entry
--hpx:print-counter=/agas{locality#*/total}/count/decrement_credit
--hpx:print-counter=/agas{locality#*/total}/count/increment_credit
--hpx:print-counter=/agas{locality#*/total}/count/resolve
--hpx:print-counter=/agas{locality#*/total}/count/resolve_gid
--hpx:print-counter=/agas{locality#*/total}/count/route
--hpx:print-counter=/agas{locality#*/total}/count/unbind
--hpx:print-counter=/agas{locality#*/total}/count/unbind_gid
--hpx:print-counter=/agas{locality#*/total}/primary/count
--hpx:print-counter=/agas{locality#*/total}/primary/time
--hpx:print-counter=/agas{locality#*/total}/symbol/count
--hpx:print-counter=/agas{locality#*/total}/symbol/time
--hpx:print-counter=/agas{locality#*/total}/time/allocate
--hpx:print-counter=/agas{locality#*/total}/time/bind
--hpx:print-counter=/agas{locality#*/total}/time/bind_gid
--hpx:print-counter=/agas{locality#*/total}/time/cache_erase_entry
--hpx:print-counter=/agas{locality#*/total}/time/cache_get_entry
--hpx:print-counter=/agas{locality#*/total}/time/cache_insert_entry
--hpx:print-counter=/agas{locality#*/total}/time/cache_update_entry
--hpx:print-counter=/agas{locality#*/total}/time/decrement_credit
--hpx:print-counter=/agas{locality#*/total}/time/increment_credit
--hpx:print-counter=/agas{locality#*/total}/time/resolve
--hpx:print-counter=/agas{locality#*/total}/time/resolve_gid
--hpx:print-counter=/agas{locality#*/total}/time/route
--hpx:print-counter=/agas{locality#*/total}/time/unbind
--hpx:print-counter=/agas{locality#*/total}/time/unbind_gid
--hpx:print-counter=/agas{locality#0/total}/count/bind_name
--hpx:print-counter=/agas{locality#0/total}/count/bind_prefix
--hpx:print-counter=/agas{locality#0/total}/component/count
--hpx:print-counter=/agas{locality#0/total}/component/time
--hpx:print-counter=/agas{locality#0/total}/count/free
--hpx:print-counter=/agas{locality#0/total}/count/localities
--hpx:print-counter=/agas{locality#0/total}/count/num_localities
--hpx:print-counter=/agas{locality#0/total}/count/num_localities_type
--hpx:print-counter=/agas{locality#0/total}/count/num_threads
--hpx:print-counter=/agas{locality#0/total}/count/resolve_id
--hpx:print-counter=/agas{locality#0/total}/count/resolve_locality
--hpx:print-counter=/agas{locality#0/total}/count/resolved_localities
--hpx:print-counter=/agas{locality#0/total}/count/unbind_name
--hpx:print-counter=/data{locality#*/total}/count/mpi/received
--hpx:print-counter=/data{locality#*/total}/count/mpi/sent
--hpx:print-counter=/data{locality#*/total}/time/mpi/received
--hpx:print-counter=/data{locality#*/total}/time/mpi/sent
--hpx:print-counter=/messages{locality#*/total}/count/mpi/received
--hpx:print-counter=/messages{locality#*/total}/count/mpi/sent
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-evictions
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-hits
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-insertions
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-misses
--hpx:print-counter=/parcelport{locality#*/total}/count/mpi/cache-reclaims
--hpx:print-counter=/parcels{locality#*/total}/count/mpi/received
--hpx:print-counter=/parcels{locality#*/total}/count/mpi/sent
--hpx:print-counter=/parcels{locality#*/total}/time/mpi/buffer_allocate/received
--hpx:print-counter=/parcels{locality#*/total}/time/mpi/buffer_allocate/sent
--hpx:print-counter=/serialize{locality#*/total}/count/mpi/received
--hpx:print-counter=/serialize{locality#*/total}/count/mpi/sent
--hpx:print-counter=/serialize{locality#*/total}/time/mpi/received
--hpx:print-counter=/serialize{locality#*/total}/time/mpi/sent
The reason for this might be running out of memory. the grids alone take up 32 gigs of RAM. That makes 16 GB per node for the grids. Depending on how much memory MPI needs etc, it might easily eat up all available memory. Could you please check that?
Verified. It's not running out of memory. Added the stack trace above.
Is this still a problem?
Don't have access to Stampede anymore... Cannot say anything
Should this be closed? Is someone able to verify if this is still a problem?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed. Please re-open if necessary.
Changeset: 817963e1c74b10c3ee459c4f8455d0d6f470822e Arguments:
--nx 100000 --np 20000 -t 16
Configuration: Debug, Release, RelWithDebInfo, Boost 1.55.0, 2 nodes, 1 locality per node, Stampede Location: hpx/lcos/promise.hpp:294 Message:Stack trace:
Log: