Sandia-OpenSHMEM / SOS

Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
Other
63 stars 53 forks source link

UCX + CMA shmem_ctx test fail: nonexistent PA #1148

Open davidozog opened 2 months ago

davidozog commented 2 months ago

https://github.com/Sandia-OpenSHMEM/SOS/actions/runs/10924530684/job/30323799111?pr=1146

The occurrence seems intermittent and rare.

FAIL: shmem_ctx
===============

[0001] DEBUG: ../../src/init.c:376: shmem_internal_heap_postinit
[0001]        Thread level=MULTIPLE, Num. PEs=2
[0001]        Sym. heap=0x5555c0000000 len=537919488 -- data=0x555555558000 len=104
[0001] DEBUG: ../../src/init.c:457: shmem_internal_heap_postinit
[0001]        Affinity to 4 processor cores: { 0 1 2 3 }
Sandia OpenSHMEM 1.5.3rc1
  SHMEM_INFO                 1 (type: bool, default: 0)
    Print library information message at startup
  SHMEM_VERSION              0 (type: bool, default: 0)
    Print library version at startup
  SHMEM_DEBUG                1 (type: bool, default: 0)
    Enable debugging messages
  SHMEM_SYMMETRIC_SIZE       536870912 (type: size, default: 536870912)
    Symmetric heap size

Additional options:
make[6]: *** [Makefile:1180: test-suite.log] Error 1
make[5]: *** [Makefile:1288: check-TESTS] Error 2
make[4]: *** [Makefile:1495: check-am] Error 2
make[3]: *** [Makefile:469: check-recursive] Error 1
  SHMEM_SYMMETRIC_HEAP_USE_HUGE_PAGES 0 (type: bool, default: 0)
    Use Linux huge pages for symmetric heap
  SHMEM_SYMMETRIC_HEAP_PAGE_SIZE 2097152 (type: size, default: 2097152)
    Page size to use for huge pages
  SHMEM_SYMMETRIC_HEAP_USE_MALLOC 0 (type: bool, default: 0)
    Allocate the symmetric heap using malloc
  SHMEM_BOUNCE_SIZE          0 (type: size, default: 2048)
    Maximum message size to bounce buffer
  SHMEM_MAX_BOUNCE_BUFFERS   128 (type: long, default: 128)
    Maximum number of bounce buffers per context
  SHMEM_TRAP_ON_ABORT        0 (type: bool, default: 0)
    Generate trap if the program aborts or calls shmem_global_exit
  SHMEM_TEAMS_MAX            10 (type: long, default: 10)
    Maximum number of teams per PE
  SHMEM_TEAM_SHARED_ONLY_SELF 0 (type: bool, default: 0)
    Include only the self PE in SHMEM_TEAM_SHARED
  SHMEM_BACKTRACE             (type: string, default: )
    Specify the mechanism to use for backtracing on failure

Collectives options:
  SHMEM_COLL_CROSSOVER       4 (type: long, default: 4)
    Crossover between linear and tree collectives (num. PEs)
  SHMEM_COLL_SIZE_CROSSOVER  16384 (type: size, default: 16384)
    Crossover between latency and bandwidth optimized collectives (msg. size)
  SHMEM_COLL_RADIX           4 (type: long, default: 4)
    Radix for tree-based collectives
  SHMEM_BARRIER_ALGORITHM    auto (type: string, default: auto)
    Algorithm for barrier.  Options are auto, linear, tree, dissem
  SHMEM_BCAST_ALGORITHM      auto (type: string, default: auto)
    Algorithm for broadcast.  Options are auto, linear, tree
  SHMEM_REDUCE_ALGORITHM     auto (type: string, default: auto)
    Algorithm for reductions.  Options are auto, linear, tree, recdbl
  SHMEM_COLLECT_ALGORITHM    auto (type: string, default: auto)
    Algorithm for collect.  Options are auto, linear
  SHMEM_FCOLLECT_ALGORITHM   auto (type: string, default: auto)
    Algorithm for fcollect.  Options are auto, linear, ring, recdbl
  SHMEM_BARRIERS_FLUSH       0 (type: bool, default: 0)
    Flush stdout and stderr on barrier

Network transport: UCX
  SHMEM_PROGRESS_INTERVAL    1000 (type: long, default: 1000)
    Polling interval for progress thread in microseconds (0 to disable)

On-node transport: Linux CMA
  SHMEM_CMA_PUT_MAX          8192 (type: size, default: 8192)
    Size below which to use CMA for puts
  SHMEM_CMA_GET_MAX          16384 (type: size, default: 16384)
    Size below which to use CMA for gets

Build information:
  Git Version           v1.5.3rc1-2-gb4c42c16 (HEAD)
  Configure Args        '--prefix=/home/runner/work/SOS/SOS/install/sos'
                        '--with-ucx=/home/runner/work/SOS/SOS/install/ucx'
                        '--with-cma' '--enable-error-checking' '--enable-profiling'
                        '--enable-pmi-simple' '--disable-fortran' '--with-hwloc=no'
  Build Date            Wed Sep 18 14:45:16 UTC 2024
  Build CC              gcc
  Build CFLAGS          -std=gnu11 -g -O2 -Wall -rdynamic -fvisibility=hidden

[0000] DEBUG: ../../src/init.c:376: shmem_internal_heap_postinit
[0000]        Thread level=MULTIPLE, Num. PEs=2
[0000]        Sym. heap=0x5555c0000000 len=537919488 -- data=0x555555558000 len=104
[0000] DEBUG: ../../src/init.c:457: shmem_internal_heap_postinit
[0000]        Affinity to 4 processor cores: { 0 1 2 3 }
[1726670878.702695] [fv-az565-923:63824:0]         parser.c:1626 UCX  WARN  unused env variable: UCX_INSTALL_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1726670878.702702] [fv-az565-923:63825:0]         parser.c:1626 UCX  WARN  unused env variable: UCX_INSTALL_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[0000] DEBUG: ../../src/transport_ucx.c:130: shmem_transport_init
[0000]        UCX thread mode 2, requested 2
[0001] DEBUG: ../../src/transport_ucx.c:130: shmem_transport_init
[0001]        UCX thread mode 2, requested 2
[0000] DEBUG: ../../src/init.c:483: shmem_internal_heap_postinit
[0000]        Local rank=0, Num. local=1, Shr. rank=0, Num. shr=2
[0001] DEBUG: ../../src/init.c:483: shmem_internal_heap_postinit
[0001]        Local rank=0, Num. local=1, Shr. rank=1, Num. shr=2
[0000] DEBUG: ../../src/shmem_team.c:139: shmem_internal_team_init
[0000]        SHMEM_TEAM_SHARED: start=0, stride=1, size=1
[0000] DEBUG: ../../src/shmem_team.c:167: shmem_internal_team_init
[0000]        SHMEMX_TEAM_NODE: start=0, stride=1, size=2
[0001] DEBUG: ../../src/shmem_team.c:139: shmem_internal_team_init
[0001]        SHMEM_TEAM_SHARED: start=1, stride=1, size=1
[0001] DEBUG: ../../src/shmem_team.c:167: shmem_internal_team_init
[0001]        SHMEMX_TEAM_NODE: start=0, stride=1, size=2
[fv-az565-923:63825:0:63825] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:  63825) ====
 0  /home/runner/work/SOS/SOS/install/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7ffff773e394]
 1  /home/runner/work/SOS/SOS/install/ucx/lib/libucs.so.0(+0x2a56f) [0x7ffff773e56f]
 2  /home/runner/work/SOS/SOS/install/ucx/lib/libucs.so.0(+0x2a856) [0x7ffff773e856]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7ffff76ff420]
 4  /home/runner/work/SOS/SOS/install/ucx/lib/libuct.so.0(uct_mm_ep_flush+0x11) [0x7ffff76bf8c1]
 5  /home/runner/work/SOS/SOS/install/ucx/lib/libucp.so.0(+0x39f61) [0x7ffff77a6f61]
 6  /home/runner/work/SOS/SOS/install/ucx/lib/libucp.so.0(ucp_ep_flush_internal+0x12d) [0x7ffff77a7f2d]
 7  /home/runner/work/SOS/SOS/install/ucx/lib/libucp.so.0(ucp_ep_close_nbx+0xde) [0x7ffff778b9be]
 8  /home/runner/work/SOS/SOS/install/ucx/lib/libucp.so.0(ucp_ep_close_nb+0x49) [0x7ffff778b8b9]
 9  /home/runner/work/SOS/SOS/build/src/.libs/libsma.so.0(+0x4b78f) [0x7ffff7ac578f]
10  /home/runner/work/SOS/SOS/build/src/.libs/libsma.so.0(+0x320d1) [0x7ffff7aac0d1]
11  /home/runner/work/SOS/SOS/build/modules/tests-sos/test/spec-example/.libs/shmem_ctx(+0x12f3) [0x5555555552f3]
12  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7ffff783e083]
13  /home/runner/work/SOS/SOS/build/modules/tests-sos/test/spec-example/.libs/shmem_ctx(_start+0x2e) [0x55555555532e]
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 63825 RUNNING AT fv-az565-923
=   EXIT CODE: 135
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
FAIL shmem_ctx (exit status: 135)

============================================================================
Testsuite summary for Sandia OpenSHMEM 1.5.3rc1
============================================================================
# TOTAL: 18
# PASS:  17
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See modules/tests-sos/test/spec-example/test-suite.log
Please report to https://github.com/Sandia-OpenSHMEM/SOS/issues
============================================================================
make[6]: Leaving directory '/home/runner/work/SOS/SOS/build/modules/tests-sos/test/spec-example'
make[5]: Leaving directory '/home/runner/work/SOS/SOS/build/modules/tests-sos/test/spec-example'
make[4]: Leaving directory '/home/runner/work/SOS/SOS/build/modules/tests-sos/test/spec-example'
make[3]: Leaving directory '/home/runner/work/SOS/SOS/build/modules/tests-sos/test'
make[2]: *** [Makefile:471: check-recursive] Error 1
make[1]: *** [Makefile:469: check-recursive] Error 1
make: *** [Makefile:562: check-recursive] Error 1
make[2]: Leaving directory '/home/runner/work/SOS/SOS/build/modules/tests-sos'
make[1]: Leaving directory '/home/runner/work/SOS/SOS/build/modules'
Error: Process completed with exit code 2.
davidozog commented 1 month ago

UCX + XPMEM, also affected, see https://github.com/Sandia-OpenSHMEM/SOS/actions/runs/11333460511/job/31521960164?pr=1154

davidozog commented 1 month ago

UCX + pmi-simple (no CMA) can be affected: https://github.com/Sandia-OpenSHMEM/SOS/actions/runs/11392002614/job/31697047000?pr=1156