Closed rdfriese closed 6 months ago
Thank you for taking the time to file this, and for all the detail, @rdfriese!
@rdfriese: @stonea was able to get an account on a similar internal system and seems to be reproducing the behavior you're seeing (which is to say, seeing faults on small node counts with LLVM but not gcc). We're currently looking into how much debugging/checks we can turn on while preserving the breakage to try and narrow down where/why it's happening and will let you know what we find. Thanks again for reporting this.
@bradcray Great thanks for the update!
As Brad mentioned I've been able to get a SEGV on one our internal systems (so hopefully it's the same issue). Specifically, it seems to work fine if I use -nl 1
or -nl 2
but will fail out with an error like this if I use -nl 3
or above:
*** Caught a fatal signal (proc 0): SIGSEGV(11)
*** NOTICE (proc 0): Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** SSH-SPAWNER (o184i081:18913): read() returned 0 (EOF)
*** Caught a signal (proc 1): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
Unfortunately, the cause of this bug is still a mystery. We'll continue to investigate but in the meantime I'll write down my observations:
I've found it necessary to run make clobber
under third-party/gasnet
in order to ensure a get a "clean" build. (I might want to rerun this one to confirm) but at one point I built Chapel using LLVM 15 and then rebuilt it to use LLVM 17, despite this I was unable to reproduce the issue. Once I did make clobber
for gasnet and rebuilt I started reproing again.
I was unable to reproduce if I built GASNet with CHPL_GASNET_MORE_CFG_OPTIONS=--enable-debug
(and Chapel with DEBUG=1 OPTIMIZE=0)
).
I also tried using the C backend with CHPL_TARGET_COMPILER=clang
(so that GASNet would be built with our bundled clang but we'd still be using the C backend) and was unable to reproduce the issue.
Not sure how useful it is, but I can get more logging output if I build GASNet with CHPL_GASNET_MORE_CFG_OPTIONS
set to one of --enable-{trace,stats,debug-malloc}
and set GASNET_TRACEFILE
, GASNET_STATSFILE
, or GASNET_MALLOCFILE
before running. (I can supply these logs if they'd be of interest).
I'm no longer able to reproduce if I use CHPL_LLVM=none
(C backend) or CHPL_GASNET_SEGMENT=fast
Some other things I tried that still caused it to be repro'able:
--no-cache-remote
,CHPL_SEGMENT=everything
Yesterday, with Andy's data, the GASNet team was able to attribute this to #22055, an issue that they found last year but which had not been addressed or since then, nor noticed in recent test runs. The specific issue is that using LLVM 16 or 17 with GASNet's everything
or large
segments results in the fault, though it's not clear whether the fault is in the GASNet code (which relies on type punning) or in LLVM (for being overly aggressive in its optimizations). [edit: And the fix/workaround is here: https://bitbucket.org/berkeleylab/gasnet/pull-requests/625 ]
On our side, we've been discussing improvements to our testing procedure w.r.t. LLVM versions to avoid mistakes like this again (specifically, though LLVM 17 is our preferred version, we have not been great about updating most of our test configurations to use it).
(I've started a slack thread with the GASNet team w.r.t. when their next release will be, to determine whether we should apply https://bitbucket.org/berkeleylab/gasnet/pull-requests/625 to our bundled version of GASNet or just wait for the next release)
Noting that with the help of the GASNet team and @jabraham17, this should now be resolved on main! Thanks for the report @rdfriese, please let us know if you encounter any further issues :)
Summary of Problem
Trying to execute a Chapel application (e.g. hello6-taskpar-dist.chpl) fails if using greater than 2 Locales, it segfaults if using LLVM, this issue goes away when CHPL_LLVM=none Description: using the following chplconfig:
Successfully compiles chapel itself, and is able to successfully compile chapel applications. The issue arises when trying to execute in a distributed environment. In the case of Hello4 and Hello6 I can successfully execute on 2 locales, but increasing beyond that results in a segfault. I have attached output when executing with GASNET_BACKTRACE=1 and 4 locales hello6_nl4.txt
I'm currently experiencing this on version 2.0, but I also had it when I last tested on v1.31
Is this a blocking issue with no known work-arounds? I have no issues if I set CHPL_LLVM=none
Steps to Reproduce
starting from a clean clone (currently using 2.0) , build chapel with the above configuration
Compile command:
chpl hello6-taskpar-dist.chpl
Execution command:
./hello6-taskpar-dist -nl 4
Associated Test(s):
test/release/examples/hello6-taskpar-dist.chpl
Configuration Information
clang = bundled version from third_party