chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

ra test crashes with CHPL_LLVM=system on macOS/aarch64 #17825

Closed PHHargrove closed 2 years ago

PHHargrove commented 3 years ago

I am working from the master branch at 7e877bf8f9. This is on an Apple M1 (aarch64) Mac Mini running macOS 11.4 (Build version 20F71) and Xcode 12.5 (Build version 12E262) from which clang reports Apple clang version 12.0.5 (clang-1205.0.22.9)

My environment has been setup with PATH=$PATH:/opt/homebrew/opt/llvm@11/bin to support CHPL_LLVM=system via Homebrew's llvm@11.

Running configure reports the following, reflecting my manual environment settings for CHPL_{COMM,COMM_SUBSTRATE,TASKS,LLVM}:

  Currently selected Chapel configuration:

CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: clang
CHPL_TARGET_ARCH: arm64
CHPL_TARGET_CPU: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: smp *
  CHPL_GASNET_SEGMENT: fast
CHPL_TASKS: fifo *
CHPL_LAUNCHER: smp
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: none
CHPL_RE2: bundled
CHPL_LLVM: system *
CHPL_AUX_FILESYS: none

With this build I get a SEGV running ra and ra-atomics tests.
I am running with CHPL_CORES_PER_LOCALE=1 at runtime (no clue if this could be related or not). When I add CHPL_COMM_DEBUG=1 to my environment and start over, I get the following when running those tests:

Assertion failed: (raddr != 0), function make_entry, file chpl-cache.c, line 2174.

I've also tried the same system with CHPL_COMM_SUBSTRATE=udp and/or CHPL_LLVM=none (all 4 combinations).
The failures do not occur for either CHPL_LLVM=none case, but do occur for both CHPL_LLVM=system cases.

I've run tests on a similarly configured x86_64 system (older macOS 11.3, but same Apple clang version and also Homebrew's llvm@11), where I am not setting CHPL_TASKS=fifo. Here I do NOT see the errors. So, it appears likely to be specific to LLVM-based code generation on aarch64. I do not currently have any Linux/aarch64 testing of Chapel (let alone with CHLP_LLVM=system).

The following are backtraces for the failing CHPL_COMM_DEBUG=1 builds of ra and ra-atomics, respectively:

[0] (lldb) bt all
[0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[0]   * frame #0: 0x00000001977eab94 libsystem_kernel.dylib`__ulock_wait + 8
[0]     frame #1: 0x00000001978258ec libsystem_pthread.dylib`_pthread_join + 456
[0]     frame #2: 0x0000000100bbe2dc ra_real`chpl_task_callMain(chpl_main=(ra_real`chpl_executable_init at chpl-init.c:300)) at tasks-fifo.c:454:8
[0]     frame #3: 0x0000000100ba77b8 ra_real`main(argc=132, argv=0x000000016f3aeb78) at main.c:33:3
[0]     frame #4: 0x0000000197841450 libdyld.dylib`start + 4
[0]   thread #2
[0]     frame #0: 0x00000001977e8edc libsystem_kernel.dylib`swtch_pri + 8
[0]     frame #1: 0x000000019782068c libsystem_pthread.dylib`cthread_yield + 20
[0]     frame #2: 0x0000000100bc0690 ra_real`chpl_thread_yield at threads-pthreads.c:317:3
[0]     frame #3: 0x0000000100bbf23c ra_real`chpl_task_yield at tasks-fifo.c:802:3
[0]     frame #4: 0x0000000100bcc0e8 ra_real`polling(x=0x0000000000000000) at comm-gasnet.c:752:5
[0]     frame #5: 0x0000000100bbe5c8 ra_real`comm_task_wrapper(arg=0x0000000000000000) at tasks-fifo.c:532:3
[0]     frame #6: 0x0000000197823878 libsystem_pthread.dylib`_pthread_start + 320
[0]   thread #3
[0] (lldb) bt all
[0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[0]   * frame #0: 0x00000001977eab94 libsystem_kernel.dylib`__ulock_wait + 8
[0]     frame #1: 0x00000001978258ec libsystem_pthread.dylib`_pthread_join + 456
[0]     frame #2: 0x000000010020c05c ra-atomics_real`chpl_task_callMain(chpl_main=(ra-atomics_real`chpl_executable_init at chpl-init.c:300)) at tasks-fifo.c:454:8
[0]     frame #3: 0x00000001001f5538 ra-atomics_real`main(argc=133, argv=0x000000016fd1eb50) at main.c:33:3
[0]     frame #4: 0x0000000197841450 libdyld.dylib`start + 4
[0]   thread #2
[0]     frame #0: 0x00000001977f31ec libsystem_kernel.dylib`__select + 8
[0]     frame #1: 0x00000001003a6ed0 ra-atomics_real`::myselect(n=4, readfds=0x0000000101603974, writefds=0x0000000000000000, exceptfds=0x0000000000000000, timeout=0x0000000101603960) at sockutil.cpp:589:16
[0]     frame #2: 0x00000001003a6dd8 ra-atomics_real`inputWaiting(s=3, dothrow=false) at sockutil.cpp:435:16
[0]     frame #3: 0x00000001003a2bb8 ra-atomics_real`::AMUDP_SPMDHandleControlTraffic(controlMessagesServiced=0x0000000000000000) at amudp_spmd.cpp:1251:5
[0]     frame #4: 0x0000000100396058 ra-atomics_real`::AM_Poll(eb=0x000000014e8040b0) at amudp_reqrep.cpp:882:18
[0]     frame #5: 0x0000000100245dfc ra-atomics_real`gasnetc_AMPoll(_gasneti_threadinfo_farg=0x000000014e904400) at gasnet_core.c:619:5
[0]     frame #6: 0x0000000100212bb4 ra-atomics_real`_gasneti_AMPoll(_gasneti_threadinfo_farg=0x000000014e904400) at gasnet_help.h:1290:18
[0]     frame #7: 0x00000001002121e0 ra-atomics_real`_gasnet_AMPoll(_gasneti_threadinfo_farg=0x000000014e904400) at gasnet_help.h:1423:12
[0]     frame #8: 0x0000000100219e90 ra-atomics_real`am_poll_try at comm-gasnet.c:743:12
[0]     frame #9: 0x0000000100219e64 ra-atomics_real`polling(x=0x0000000000000000) at comm-gasnet.c:751:5
[0]     frame #10: 0x000000010020c348 ra-atomics_real`comm_task_wrapper(arg=0x0000000000000000) at tasks-fifo.c:532:3
[0]     frame #11: 0x0000000197823878 libsystem_pthread.dylib`_pthread_start + 320
[0]   thread #3
[0]     frame #0: 0x00000001977e8edc libsystem_kernel.dylib`swtch_pri + 8
[0]     frame #1: 0x000000019782068c libsystem_pthread.dylib`cthread_yield + 20

Neither looks terribly informative to me as the asserting thread's info appears to be missing.

bradcray commented 3 years ago

Tagging @mppf on this due to the assertion coming from the remote cache.

mppf commented 3 years ago

@PHHargrove - I've tried a few different variations on the configuration but so far I haven't been able to reproduce on an ARM linux64 system. It is possible that there is indeed a problem with LLVM code generation on ARM but if that's the case then I'm not seeing it. We have in the past occasionally run into MacOS specific issues with communication (because the ASLR strategy is pretty different) and so I am worried that this issue is specific to the M1 and/or MacOS, and I don't currently have one of those I can debug on. (Please do email me if it's possible for me to log in to the one you are testing on).

bradcray commented 3 years ago

@mppf: Given the nature of the assertion, would compiling without --cache-remote be interesting to have Paul try to see if it passes? (And then, if so, @PHHargrove, would you be interested in trying to turn that off (compiling with --no-cache-remote, or would you rather we just sort through it ourselves—which would be completely reasonable).

w.r.t. CHPL_CORES_PER_LOCALE=1 — is that the right variable name? It isn't looking familiar to me, and I think you mentioned CHPL_THREADS_PER_LOCALE when we were chatting last night?

And then (once we get that variable name correct): Michael, did you set that variable in your ARM linux64 tests as well? I worry that limiting to one task/thread could potentially be a factor here (mostly coming from a superstitious "Chapel doesn't do very well with just 1 thread" perspective.

PHHargrove commented 3 years ago

The system in question sits in my home and so is not available for outside access (even to myself).
If any member of your team has an account with the GCC CFarm, they have a comparable system that might be used.

I am fine testing --no-cache-remote at some point, but have some important work to be completed prior to the Perlmutter dedication on Thu. That has top priority right now.

I did mention CHPL_THREADS_PER_LOCALE in Slack, and CHPL_CORES_PER_LOCALE here. Neither is a typo. The former was causing problems (either super slow or maybe hung), while the latter is the one I find I must set on Spock to prevent your slurm-based launcher from asking for too many cores in the srun command (I hope to file an issue for that eventually, but Shasta is a moving target as you know). The CORES variable might not be doing anything at all here, if it is specific to launch logic not in use here.

PHHargrove commented 3 years ago

I have retested with master now at 7bb0e8cf.

Removing any *_PER_LOCALE setting from my environment (or substituting CHPL_RT_OVERSUBSCRIBED=yes), I see ra and ra-atomics run for a looong time before our CI kills them at 12 minutes. So, broken in a different manner than previously reported.

I tried building ra with --no-cache-remote and ra-atomics without, in the same run of our CI.
In this case ra completed nearly instantaneously, and ra-atomics ran for 12min before it was killed.

PHHargrove commented 3 years ago

Having identified lack of linux/aarch64 in my own coverage of Chapel, I am in the midst of getting such added.
Along the way I have confirmed that I cannot reproduce this issue on that system.

PHHargrove commented 3 years ago

I now have what looks like a useful backtrace from ./ra -nl 3 --n=10.
While this one lacks arguments, it does at least include the right thread.
Note that the top 4 frames are the GASNet logic to obtain the backtrace.

Assertion failed: (raddr != 0), function make_entry, file chpl-cache.c, line 2174.
*** Caught a fatal signal (proc 0): SIGABRT(6)
[0] Invoking EXECINFO for backtrace...
[0] 0: 0   ra_real                             0x0000000102dcacdc gasneti_bt_execinfo + 44 
[0] 1: 1   ra_real                             0x0000000102dc5d58 gasneti_print_backtrace + 972 
[0] 2: 2   ra_real                             0x0000000102dc6748 _gasneti_print_backtrace_ifenabled + 148 
[0] 3: 3   ra_real                             0x0000000102ef20e8 gasneti_defaultSignalHandler + 304 
[0] 4: 4   libsystem_platform.dylib            0x000000019786ec44 _sigtramp + 56 
[0] 5: 5   libsystem_pthread.dylib             0x000000019782343c pthread_kill + 292 
[0] 6: 6   libsystem_c.dylib                   0x000000019776b460 abort + 104 
[0] 7: 7   libsystem_c.dylib                   0x000000019776a8f4 err + 0 
[0] 8: 8   ra_real                             0x0000000102d6f060 make_entry + 80 
[0] 9: 9   ra_real                             0x0000000102d6e938 get_reserved_entry + 644 
[0] 10: 10  ra_real                             0x0000000102d6e36c cache_put_in_page + 208 
[0] 11: 11  ra_real                             0x0000000102d6a874 cache_put + 444 
[0] 12: 12  ra_real                             0x0000000102d6a5a0 chpl_cache_comm_put + 568 
[0] 13: 13  ra_real                             0x0000000102d21138 on_fn_chpl207 + 1744 
[0] 14: 14  ra_real                             0x0000000102d739d0 chpl_std_module_init + 88 
[0] 15: 15  ra_real                             0x0000000102d73a08 chpl_executable_init + 28 
[0] 16: 16  ra_real                             0x0000000102d794c4 do_callMain + 56 
[0] 17: 17  libsystem_pthread.dylib             0x0000000197823878 _pthread_start + 320 
[0] 18: 18  libsystem_pthread.dylib             0x000000019781e5e0 thread_start + 8 
bradcray commented 3 years ago

The make_entry call in ra_real at frame 8 (as well as other nearby frames) are from the code supporting --cache-remote which corroborates the previous guesses and your --no-cache-remote run (though I have no good guess as to why ra-atomics ran off the rails...).

mppf commented 3 years ago

Yeah. And this failure seems weird to me because I think it is basically saying it was about to do a PUT to address 0 (i.e. NULL). Of course something could have gotten corrupted somewhere to cause that.

aconsroe-hpe commented 2 years ago

Three things I'm noting that may be related:

1) We recently identified issues in the comm cache during startup (See #18800 for pt1 of that fix) and the frame chpl_std_module_init catches my eye b/c it is the same bad code path that we were seeing becoming an issue (and is before we call chpl_gen_main) 2) In trying to reproduce this, I used CHPL_TASKS=fifo with CHPL_RT_NUM_THREADS_PER_LOCALE=1 and got a hang. This is expected (see #18302) and while I'm not sure I've followed where CHPL_CORES_PER_LOCALE comes from it might be worth double checking 3) And as a combination between the above, I had the most success reproducing our failure in 1) above by setting CHPL_RT_NUM_THREADS_PER_LOCALE=1

ronawho commented 2 years ago

[edit -- whoops, in my original comment I ran without GASNET_ROUTE_OUTPUT/GASNET_*IP, which led to some extra failures, updated my test results to include those now]

We now have an m1 mac we can run on and I can reproduce this.

I can reproduce the failures for ra.chpl and ra-atomics.chpl. They do not go away for me with --no-cache-remote, but they do with --target-compiler=clang. Some of the cache-remote specific tests do fail, so I suspect we may have both llvm and cache-remote bugs here.

I'm just using a standard local gasnet-udp with GASNET_SPAWNFN=L to run. Here's my config:

./util/printchplenv --all --anonymize
CHPL_HOST_PLATFORM: darwin
CHPL_HOST_COMPILER: clang
  CHPL_HOST_CC: clang
  CHPL_HOST_CXX: clang++
CHPL_HOST_ARCH: arm64
CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: llvm
  CHPL_TARGET_CC: /opt/homebrew/Cellar/llvm@11/11.1.0_4/bin/clang
  CHPL_TARGET_CXX: /opt/homebrew/Cellar/llvm@11/11.1.0_4/bin/clang++
CHPL_TARGET_ARCH: arm64
CHPL_TARGET_CPU: native *
CHPL_LOCALE_MODEL: flat
  CHPL_GPU_CODEGEN: none
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: fifo
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_HOST_MEM: cstdlib
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: none
CHPL_RE2: bundled
CHPL_LLVM: system
  CHPL_LLVM_CONFIG: /opt/homebrew/opt/llvm@11/bin/llvm-config
CHPL_AUX_FILESYS: none
CHPL_LIB_PIC: none
CHPL_SANITIZE: none
CHPL_SANITIZE_EXE: none
printenv | grep GASNET_
GASNET_SPAWNFN=L
GASNET_ROUTE_OUTPUT=0
GASNET_MASTERIP=127.0.0.1
GASNET_WORKERIP=127.0.0.0

Here's the full list of failures for $CHPL_HOME/util/test/paratest.server -nodepara 4 -dirs release/examples -dirs runtime/configMatters -compopts --no-cache-remote:

``` [Test Summary - 220208.101057] [Error matching program output for release/examples/benchmarks/hpcc/variants/ra-cleanloop] [Error matching program output for release/examples/benchmarks/hpcc/ra-atomics] [Error matching program output for release/examples/benchmarks/hpcc/ra] [Error matching program output for release/examples/primers/fileIO] [Error matching program output for runtime/configMatters/comm/cache-remote/acqrel-sync-subtask] [Error matching program output for runtime/configMatters/comm/cache-remote/acqrel-sync] [Error matching program output for runtime/configMatters/comm/cache-remote/acqrel] [Error matching program output for runtime/configMatters/comm/cache-remote/atomic] [Error matching program output for runtime/configMatters/comm/cache-remote/atomic3] [Error matching program output for runtime/configMatters/comm/cache-remote/bigparread] [Error matching program output for runtime/configMatters/comm/cache-remote/cobegin] [Error matching program output for runtime/configMatters/comm/cache-remote/cobegin3] [Error matching program output for runtime/configMatters/comm/cache-remote/cobegin4] [Error matching program output for runtime/configMatters/comm/cache-remote/coforall] [Error matching program output for runtime/configMatters/comm/cache-remote/coforall3] [Error matching program output for runtime/configMatters/comm/cache-remote/coforall4] [Error matching program output for runtime/configMatters/comm/cache-remote/forall] [Error matching program output for runtime/configMatters/comm/cache-remote/forall2] [Error matching program output for runtime/configMatters/comm/cache-remote/forall3] [Error matching program output for runtime/configMatters/comm/cache-remote/miniUnorderedCopyStress] [Error matching program output for runtime/configMatters/comm/unordered/unorderedAtomicsStress (execopts: 5)] [Error matching program output for runtime/configMatters/comm/oversubscribedArrayAlloc] [Error matching program output for runtime/configMatters/comm/remote-taskspawn-stress] [Summary: #Successes = 251 | #Failures = 23 | #Futures = 0] [END] ```

cc @mppf and @daviditen I don't know how to dig into this myself, but am happy to help. Ping me if you need instructions for getting onto our m1 mac.

ronawho commented 2 years ago

For reasons I don't understand yet, this seems to go away when using CHPL_MEM=cstdlib instead of jemalloc. I'm going to change our default and open a followup issue to dig into what's going wrong.

ronawho commented 2 years ago

I believe this is "resolved" (worked around) in https://github.com/chapel-lang/chapel/pull/20160 by switching which allocator we use. I've opened https://github.com/Cray/chapel-private/issues/3532 to track using the more optimized allocator in the future, but assuming this resolves the issue for @PHHargrove I think we can/should close this issue.

PHHargrove commented 2 years ago

@ronawho I will watch our impacted nightly testers and hopefully have feedback for you in a day or two.

PHHargrove commented 2 years ago

@ronawho Bad news:

Now all Chapel applications are failing to run on our M1 mac, with the following message:

error: Your CHPL_MEM setting doesn't support the registered heap required by your CHPL_COMM setting. You'll need to change one of these configurations.error: Your CHPL_MEM setting doesn't support the registered heap required by your CHPL_COMM setting. You'll need to change one of these configurations.error: Your CHPL_MEM setting doesn't support the registered heap required by your CHPL_COMM setting. You'll need to change one of these configurations.

The relevant settings are CHPL_MEM=cstdlib and CHPL_COMM=gasnet, plus CHPL_COMM_SUBSTRATE=smp and CHPL_GASNET_SEGMENT=fast. I suspect the issue this message is reporting is that CHPL_MEM=cstdlib doesn't confine its allocation to the GASNet segment. So, this change of memory allocator is not a solution for CHPL_COMM_SUBSTRATE=smp (which requires CHPL_GASNET_SEGMENT=fast) but might be tenable for udp and mpi (which support CHPL_GASNET_SEGMENT=everything) if this error message is sensitive to the segment setting.

ronawho commented 2 years ago

Ah, of course. My testing was with gasnet-udp segment everything using local spawning (sorry, I should have actually looked at your config, instead of using udp out of habit.) And yeah, CHPL_MEM=cstdlib is a direct passthrough to the system allocator and doesn't restrict to the gasnet segment, which is why it doesn't work in this configuration.

For the moment I would say segment fast/large are not supported until we get an allocator that supports them working.

PHHargrove commented 2 years ago

For the moment I would say segment fast/large are not supported until we get an allocator that supports them working.

That is an improvement over the previous state.

My initial comment in this issue reported that the problem was also present with udp. However, that was based on manual testing because our automated testing on the Apple M1 hardware does not include Chapel over udp. So, I am going to need more time to confirm that case is actually fixed.

ronawho commented 2 years ago

FYI I created some interim "documentation" about the status of Chapel on arm based macs in https://github.com/chapel-lang/chapel/issues/20183. Part of that includes saying that gasnet segment large/fast are not supported. Assuming you find that gasnet-udp segment everything works for you, I'd be inclined to close this and we'll track progress in our other issues (and if easy I'd suggest changing your testing from the smp substrate to udp)

PHHargrove commented 2 years ago

@ronawho Your plan sounds fine to me.

Hopefully I'll have udp/everything results with Chapel on Sunday morning, though there is a chance I won't check on the outcome until Monday.

PHHargrove commented 2 years ago

Unfortunately, the run for the M1+udp on Sunday didn't pickup my change to add Chapel, due to an error on my part. I've fixed that problem, and the next shot for this configuration will be Tue morning.

PHHargrove commented 2 years ago

🤦‍♂️ I still managed to botch the settings. Next shot is on Thu morning.

PHHargrove commented 2 years ago

I am afraid I still see this problem with UDP/everything

Warning: CHPL_LLVM=system is not compatible with this platform
CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: arm64
CHPL_TARGET_CPU: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: udp *
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: fifo *
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: cstdlib
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: none
CHPL_HWLOC: none
CHPL_RE2: none
CHPL_LLVM: system *
CHPL_AUX_FILESYS: none

in a debug build yields the original failure:

Assertion failed: (raddr != 0), function make_entry, file chpl-cache.c, line 2236.
*** Caught a fatal signal (proc 0): SIGABRT(6)
[0] Invoking LLDB for backtrace...
[0] /usr/bin/lldb -p 73527 -o 'bt all' -o quit
[0] (lldb) process attach --pid 73527
[0] Process 73527 stopped
[0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[0]     frame #0: 0x000000019a9aeb94 libsystem_kernel.dylib`__ulock_wait + 8
[0] libsystem_kernel.dylib`__ulock_wait:
[0] ->  0x19a9aeb94 <+8>:  b.lo   0x19a9aebb4               ; <+40>
[0]     0x19a9aeb98 <+12>: pacibsp 
[0]     0x19a9aeb9c <+16>: stp    x29, x30, [sp, #-0x10]!
[0]     0x19a9aeba0 <+20>: mov    x29, sp
[0] Target 0: (ra_real) stopped.
[0] 
[0] Executable module set to "/[...]/ra_real".
[0] Architecture set to: arm64e-apple-macosx-.
[0] (lldb) bt all
[0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[0]   * frame #0: 0x000000019a9aeb94 libsystem_kernel.dylib`__ulock_wait + 8
[0]     frame #1: 0x000000019a9e98ec libsystem_pthread.dylib`_pthread_join + 456
[0]     frame #2: 0x0000000100cc1060 ra_real`chpl_task_callMain(chpl_main=(ra_real`chpl_executable_init at chpl-init.c:356)) at tasks-fifo.c:375:8
[0]     frame #3: 0x0000000100ca9190 ra_real`main(argc=134, argv=0x000000016f2a6b78) at main.c:33:3
[0]     frame #4: 0x000000019aa05450 libdyld.dylib`start + 4
[0]   thread #2
[0]     frame #0: 0x000000019a9acedc libsystem_kernel.dylib`swtch_pri + 8
[0]     frame #1: 0x000000019a9e468c libsystem_pthread.dylib`cthread_yield + 20
[0]     frame #2: 0x0000000100cc2a7c ra_real`chpl_thread_yield at threads-pthreads.c:317:3
[0]     frame #3: 0x0000000100cc19a8 ra_real`chpl_task_yield at tasks-fifo.c:586:3
[0]     frame #4: 0x0000000100ccf1b8 ra_real`polling(x=0x0000000000000000) at comm-gasnet.c:752:5
[0]     frame #5: 0x0000000100cc1380 ra_real`comm_task_wrapper(arg=0x0000000000000000) at tasks-fifo.c:446:3
[0]     frame #6: 0x000000019a9e7878 libsystem_pthread.dylib`_pthread_start + 320
[0]   thread #3

And a non-debug build just dies with a SEGV

ronawho commented 2 years ago

Can you try with the c-backend instead of llvm? Along with changing the default allocator, we changed the backend for m1 in https://github.com/chapel-lang/chapel/pull/20155. (Sorry, didn't realize you were explicitly setting the backend)

PHHargrove commented 2 years ago

I will retry with the c-backend, but that is a case that had not failed before (which is why "with CHPL_LLVM=system" appears in this issue's title).

From my initial report:

I've also tried the same system with CHPL_COMM_SUBSTRATE=udp and/or CHPL_LLVM=none (all 4 combinations). The failures do not occur for either CHPL_LLVM=none case, but do occur for both CHPL_LLVM=system cases.

So, it now sounds like instead of having 2 of the 4 combinations (the two using CHPL_LLVM=none) available, your recent change has actually reduced the set of working configurations to just 1 (having invalidated use of smp/fast w/ the c-backend due to the allocator choice).

ronawho commented 2 years ago

I will retry with the c-backend, but that is a case that had not failed before (which is why "with CHPL_LLVM=system" appears in this issue's title).

Sorry, I did a poor job digesting the original issue and was commenting mostly on my experience where I was still seeing similar failures with the c-backend prior to switching the memory allocator to cstdlib.

your recent change has actually reduced the set of working configurations to just 1.

Yeah, that's right

PHHargrove commented 2 years ago

c-backend results with udp/everything should appear on Sunday morning

PHHargrove commented 2 years ago

As hoped, {c-backend, udp, everything} passed the ra and ra-atomics tests on our Apple M1 systems,

ronawho commented 2 years ago

Closing, since the original issue is resolved. I'm hoping to expand the supported configurations in the near-ish future to enable support for gasnet-smp, but don't have an exact timeframe yet.

ronawho commented 1 year ago

Finally getting back around to look for a real solution to this. It looks like it's actually from guard pages on fifo and we disable guard pages on macs when using cstdlib memory, so switching to cstdlib just happened to work around the issue.

Guard pages under fifo use mprotect with PROT_NONE to enable guard pages and PROT_EXEC | PROT_WRITE | PROT_READ to disable them, but it turns out on arm macs since macOS 11.2 you can't add PROT_EXEC and PROT_WRITE.

From newer mac mmap man pages:

When the hardened runtime is enabled (See the links in the SEE ALSO section), the protections cannot be both PROT_WRITE and PROT_EXEC without also having the flag MAP_JIT and the process possessing the com.apple.security.cs.allow-jit entitlement

Then our disable guard pages mprotect ignored errors: https://github.com/chapel-lang/chapel/blob/7ec625578f5fff29b8782e32b4c4b4daa5aef4e8/runtime/src/threads/pthreads/threads-pthreads.c#L182-L197

So we were just free'ing memory with guard pages still in and later use of that memory would sporadically hit the guard page and segfault. Disabling cache or other things would change when we might hit that, but was ultimately a red-herring.

We only saw this for gasnet, because non-0 locales have chpl_task_callMain finish before user startup so it was that guard page causing issues. For comm=none (or locale 0 in multi-locale) no threads are destroyed before program termination so there were no invalid guard pages to hit.

For fixing, I'm not actually sure why we're setting PROT_EXEC. We're not creating pages with it originally set, so I think we can just avoid setting it.

ronawho commented 1 year ago

Here's a standalone C program that demonstrates the underlying issues:

#include <stdio.h>
#include <sys/mman.h>
#include <assert.h>

int main() {
    void *addr = mmap(0, 1024, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, 0, 0);
    assert(addr != NULL);

    // enable gaurd
    assert(mprotect(addr, 1024, PROT_NONE) == 0);

    // disable gaurd
    assert(mprotect(addr, 1024, PROT_EXEC | PROT_WRITE | PROT_READ) == 0);
}
Assertion failed: (mprotect(addr, 1024, PROT_EXEC | PROT_WRITE | PROT_READ) == 0), function main, file mprotectTest.c, line 13.
ronawho commented 1 year ago

Fixed in https://github.com/chapel-lang/chapel/pull/23232

ronawho commented 1 year ago

@PHHargrove no need to test or change automated configs for my sake, but just a heads up that you should be able to use the smp conduit (or any config that requires a shared heap / fixed segment) now

PHHargrove commented 1 year ago

@ronawho FYI: I've reenabled Chapel testing on one Apple M1 system where I'd previous disabled it due to this issue. Results show the handful of tests which we run are all passing.