chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

[Bug]: Segfault when running applications with LLVM 16 or 17 + Gasnet IBV #24779

Closed rdfriese closed 5 months ago

rdfriese commented 5 months ago

Summary of Problem

Trying to execute a Chapel application (e.g. hello6-taskpar-dist.chpl) fails if using greater than 2 Locales, it segfaults if using LLVM, this issue goes away when CHPL_LLVM=none Description: using the following chplconfig:

CHPL_TARGET_CPU=none
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=ibv
CHPL_LAUNCHER=slurm-gasnetrun_ibv
CHPL_LLVM=bundled

Successfully compiles chapel itself, and is able to successfully compile chapel applications. The issue arises when trying to execute in a distributed environment. In the case of Hello4 and Hello6 I can successfully execute on 2 locales, but increasing beyond that results in a segfault. I have attached output when executing with GASNET_BACKTRACE=1 and 4 locales hello6_nl4.txt

I'm currently experiencing this on version 2.0, but I also had it when I last tested on v1.31

Is this a blocking issue with no known work-arounds? I have no issues if I set CHPL_LLVM=none

Steps to Reproduce

starting from a clean clone (currently using 2.0) , build chapel with the above configuration

Compile command: chpl hello6-taskpar-dist.chpl

Execution command:

./hello6-taskpar-dist -nl 4

Associated Test(s):

test/release/examples/hello6-taskpar-dist.chpl

Configuration Information

chpl --version
chpl version 2.0.0
  built with LLVM version 17.0.6
  available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
$CHPL_HOME/util/printchplenv --anonymize
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none +
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet +
  CHPL_COMM_SUBSTRATE: ibv +
  CHPL_GASNET_SEGMENT: large
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-gasnetrun_ibv +
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled +
CHPL_AUX_FILESYS: none
 gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-16)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

clang = bundled version from third_party

bradcray commented 5 months ago

Thank you for taking the time to file this, and for all the detail, @rdfriese!

bradcray commented 5 months ago

@rdfriese: @stonea was able to get an account on a similar internal system and seems to be reproducing the behavior you're seeing (which is to say, seeing faults on small node counts with LLVM but not gcc). We're currently looking into how much debugging/checks we can turn on while preserving the breakage to try and narrow down where/why it's happening and will let you know what we find. Thanks again for reporting this.

rdfriese commented 5 months ago

@bradcray Great thanks for the update!

stonea commented 5 months ago

As Brad mentioned I've been able to get a SEGV on one our internal systems (so hopefully it's the same issue). Specifically, it seems to work fine if I use -nl 1 or -nl 2 but will fail out with an error like this if I use -nl 3 or above:

*** Caught a fatal signal (proc 0): SIGSEGV(11)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** SSH-SPAWNER (o184i081:18913): read() returned 0 (EOF)
*** Caught a signal (proc 1): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)

Unfortunately, the cause of this bug is still a mystery. We'll continue to investigate but in the meantime I'll write down my observations:

bradcray commented 5 months ago

Yesterday, with Andy's data, the GASNet team was able to attribute this to #22055, an issue that they found last year but which had not been addressed or since then, nor noticed in recent test runs. The specific issue is that using LLVM 16 or 17 with GASNet's everything or large segments results in the fault, though it's not clear whether the fault is in the GASNet code (which relies on type punning) or in LLVM (for being overly aggressive in its optimizations). [edit: And the fix/workaround is here: https://bitbucket.org/berkeleylab/gasnet/pull-requests/625 ]

On our side, we've been discussing improvements to our testing procedure w.r.t. LLVM versions to avoid mistakes like this again (specifically, though LLVM 17 is our preferred version, we have not been great about updating most of our test configurations to use it).

bradcray commented 5 months ago

(I've started a slack thread with the GASNet team w.r.t. when their next release will be, to determine whether we should apply https://bitbucket.org/berkeleylab/gasnet/pull-requests/625 to our bundled version of GASNet or just wait for the next release)

lydia-duncan commented 4 months ago

Noting that with the help of the GASNet team and @jabraham17, this should now be resolved on main! Thanks for the report @rdfriese, please let us know if you encounter any further issues :)