chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Multilocale interop tests hang on darwin #16816

Open daviditen opened 3 years ago

daviditen commented 3 years ago

Multilocale interop tests deadlock/hang with CHPL_COMM=gasnet on darwin.

cd test/interop/C/multilocale/exportString
start_test noArgsRetString.ml-test.c
...
timedexec Alarm Clock
timedexec sending SIGTERM
*** Caught a signal (proc 0): SIGTERM(15)
[Elapsed execution time for "interop/C/multilocale/exportString/noArgsRetString" - 300.186 seconds]
[Elapsed time to compile and execute all versions of "interop/C/multilocale/exportString/noArgsRetString" - 322.771 seconds]
[Finished subtest "interop/C/multilocale/exportString/noArgsRetString.ml-test" - 323.302 seconds]
...
[Test Summary - 201204.142249]
[Error: Timed out executing program interop/C/multilocale/exportString/noArgsRetString]
[Summary: #Successes = 0 | #Failures = 1 | #Futures = 0 | #Warnings = 0 ]
[Summary: #Passing Suppressions = 0 | #Passing Futures = 0 ]
[END]
% chpl --version
chpl version 1.24.0 pre-release (8a6d3bd241)
  built with LLVM version 10.0.1
Copyright 2020 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
% printchplenv --anonymize
CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: clang
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: qthreads
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_UNWIND: system *
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_REGEXP: re2
CHPL_LLVM: bundled
CHPL_AUX_FILESYS: none
lydia-duncan commented 3 years ago

Do you have CHPL_RT_MASTERIP and CHPL_RT_WORKERIP set?

lydia-duncan commented 3 years ago

It could be that the server and client are failing to connect, we've seen this happen on Macs before and added those variables to help with it (I'm realizing the documentation is not as clear about situations where that could occur: https://chapel-lang.org/docs/latest/technotes/libraries.html#hostnames-and-connection-issues)

bradcray commented 3 years ago

Could / should the tests set those rather than requiring developers to do so? (I realize this doesn't help end-users...).

lydia-duncan commented 3 years ago

Doing so would need to happen on a per system basis - the settings that are appropriate for Macs are decidedly not appropriate on supercomputers, for instance

daviditen commented 3 years ago

I did not have them set. What should I set them to? Setting them both to 127.0.0.1 didn't fix the hang. I also tried setting both to my computer's $HOSTNAME with the same result.

lydia-duncan commented 3 years ago

I have mine set like this:

    export CHPL_RT_MASTERIP=127.0.0.1
    export CHPL_RT_WORKERIP=127.0.0.0
bradcray commented 3 years ago

If this is a matter of settings (though David's response makes me worried that it's not), it seems like we could special-case the settings for CHPL_TARGET_PLATFORM=mac to better support our developers?

(My impression was that the defaults worked most of the time on typical platforms and that Macs were special / different in some way... though I can't recall why or what the problem was...)

daviditen commented 3 years ago

OK, setting

export CHPL_RT_WORKERIP=127.0.0.0
export CHPL_RT_MASTERIP=127.0.0.1

Worked. Setting them both to the same value like I guessed didn't.

bradcray commented 3 years ago

So should/could we put an EXECENV file in all multilocale interop directories that says something like:

#!/bin/bash

unamestr=`uname`

if [[ $unamestr == "Darwin" ]]; then
  echo CHPL_RT_WORKERIP=127.0.0.0
  echo CHPL_RT_MASTERIP=127.0.0.1
fi
lydia-duncan commented 3 years ago

Setting them both to the same value like I guessed didn't.

Yeah, my memory is that they very specifically needed to not point at the same place for macs.

Brad, your suggestion seems reasonable, though I'd lean on our CHPL_TARGET_PLATFORM instead of uname (especially once Elliot's fix for execenv's goes in)

bradcray commented 3 years ago

though I'd lean on our CHPL_TARGET_PLATFORM instead of uname

Why is that?

lydia-duncan commented 3 years ago

Code reuse and maintainability. I don't see value in reimplementing how to determine the platform being used for this particular case, I'd much rather rely on the strategy that already does that determination. If Macs decide to indicate their platform in a different way for the next operating system release, I'd rather not have to update a bunch of test scripts in addition to how we determine CHPL_TARGET_PLATFORM

bradcray commented 3 years ago

OK. I think it's unlikely that Macs will change their uname string before we change our CHPL_TARGET_PLATFORM string and that even if they did, we'd be in trouble since we use uname to determine darwin as its value. But whoever implements the fix can make the choice as far as I'm concerned (not it!). I grabbed this pattern from other EXECENVs in the interop/multilocale directory.

lydia-duncan commented 3 years ago

Fair. I also don't see the point in changing the strategy that's already in use to suit my personal preference, so I'd be fine with extending it in that way