Bears-R-Us / arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
Other
250 stars 90 forks source link

`'(nil)'` file created upon launching arkouda server #2816

Closed alvaradoo closed 1 year ago

alvaradoo commented 1 year ago

Describe the bug Every since updating to Arkouda v2023.10.06, launching the arkouda server with /path/to/arkouda-2023.10.06/arkouda_server -nl 4 causes a file named '(nil)' to be created with contents printing out the information of the head node n14.cluster.local 5555 tcp://n14.cluster.local:5555. This happens on Kruskal (cluster at UMD Physical Sciences Lab), I have not tested it on other systems or locally.

To Reproduce Steps to reproduce the behavior:

  1. Launch arkouda server.
  2. ls in the directory you ran the server from.

Expected behavior Expected the file to not be created.

Error Message None.

Is this a Blocking Issue Nope.

ak.get_config() Output

{'arkoudaVersion': 'v2023.10.06', 'chplVersion': '1.32.0', 'ZMQVersion': '4.3.4', 'HDF5Version': '1.12.2', 'serverHostname': 'n14.cluster.local', 'ServerPort': 5555, 'numLocales': 4, 'numPUs': 128, 'maxTaskPar': 128, 'physicalMemory': 549739035728, 'distributionType': 'BlockDom(1,int(64),one,unmanaged DefaultDist)', 'LocaleConfigs': [{'id': 0, 'name': 'n14.cluster.local', 'numPUs': 128, 'maxTaskPar': 128, 'physicalMemory': 549739035728}, {'id': 1, 'name': 'n15.cluster.local', 'numPUs': 128, 'maxTaskPar': 128, 'physicalMemory': 549739035728}, {'id': 2, 'name': 'n16.cluster.local', 'numPUs': 128, 'maxTaskPar': 128, 'physicalMemory': 549739035728}, {'id': 3, 'name': 'n17.cluster.local', 'numPUs': 128, 'maxTaskPar': 128, 'physicalMemory': 549739035728}], 'authenticate': False, 'logLevel': 'INFO', 'logChannel': 'CONSOLE', 'regexMaxCaptures': 20, 'byteorder': 'little', 'autoShutdown': False, 'serverInfoNoSplash': False}

Additional context Grepped environment variables. Note: the problem still happens if ARKOUDA_QUICK_COMPILE is not set.

(arkouda-dev) [oaa9@kruskal-login1 arkouda-njit]$ env | grep -E 'CHPL|ARK'
CHPL_TARGET_CPU=native
CHPL_DIR=/scratch/shared/apps/chapel-1.32.0
CHPL_BIN_SUBDIR=/scratch/shared/apps/chapel-1.32.0/util/chplenv/chpl_bin_subdir.py
CHPL_LAUNCHER=slurm-srun
CHPL_GMP=bundled
CHPL_GASNET_MORE_CFG_OPTIONS=--with-pmi-home=/opt/scyld/slurm/
ARKOUDA_QUICK_COMPILE=true
CHPL_COMM_SUBSTRATE=ibv
CHPL_COMM=gasnet
CHPL_MEM=jemalloc
CHPL_HWLOC=bundled
CHPL_TASKS=qthreads
CHPL_LLVM=bundled
CHPL_HOME=/scratch/shared/apps/chapel-1.32.0
CHPL_GASNET_SEGMENT=fast

Chapel environment.

(arkouda-dev) [oaa9@kruskal-login1 arkouda-njit]$ printchplenv --all --anonymize
CHPL_HOST_PLATFORM: linux64
CHPL_HOST_COMPILER: gnu
  CHPL_HOST_CC: gcc
  CHPL_HOST_CXX: g++
CHPL_HOST_ARCH: x86_64
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
  CHPL_TARGET_CC: /scratch/shared/apps/chapel-1.32.0/third-party/llvm/install/linux64-x86_64/bin/clang --gcc-toolchain=/scratch/shared/apps/gcc/gcc-13.1.0
  CHPL_TARGET_CXX: /scratch/shared/apps/chapel-1.32.0/third-party/llvm/install/linux64-x86_64/bin/clang++ --gcc-toolchain=/scratch/shared/apps/gcc/gcc-13.1.0
  CHPL_TARGET_LD: /scratch/shared/apps/chapel-1.32.0/third-party/llvm/install/linux64-x86_64/bin/clang++ --gcc-toolchain=/scratch/shared/apps/gcc/gcc-13.1.0
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: ibv *
  CHPL_GASNET_SEGMENT: fast *
  CHPL_GASNET_VERSION: 1
CHPL_TASKS: qthreads *
CHPL_LAUNCHER: slurm-srun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_HOST_MEM: jemalloc
CHPL_MEM: jemalloc *
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled *
CHPL_HWLOC: bundled *
CHPL_RE2: bundled
CHPL_LLVM: bundled *
  CHPL_LLVM_SUPPORT: bundled
  CHPL_LLVM_CONFIG: /scratch/shared/apps/chapel-1.32.0/third-party/llvm/install/linux64-x86_64/bin/llvm-config
  CHPL_LLVM_VERSION: 15
CHPL_AUX_FILESYS: none
CHPL_LIB_PIC: none
CHPL_SANITIZE: none
CHPL_SANITIZE_EXE: none
stress-tess commented 1 year ago

I am able reproduce this on my machine when CHPL_COMM: gasnet. For me it writes the file 0x0 with my connection info:

cat 0x0
Pierces-MacBook-Pro.local 5555 tcp://Pierces-MacBook-Pro.local:5555

This happens when launch the server, you don't even need to connect from the client. I was able to trace this back to this proc: https://github.com/Bears-R-Us/arkouda/blob/f3f1de8930a536fbb8b40d1538a14a16aab0856b/src/ServerDaemon.chpl#L269-L279

which matches what we see in the file and when setting logger level to DEBUG, the log statements match what I'm seeing:

[ServerDaemon] createServerConnectionInfo Line 272 DEBUG [Chapel] writing serverConnectionInfo to 0x0

I'm hoping this is just due to this proc not using writefCompat from ArkoudaIOCompat. I'll try updating that and see if it fixes the problem. If not I'll lean on @bmcdonald3 and @hokiegeek2 since they both know aspects of this block of code better than me.

If this is the result of a missing compat call, I wonder if we should look for places in the code where we import a compat module within the scope of the proc. Cause it feels like that might be what kept us from catching this one

stress-tess commented 1 year ago

All the IO and compat stuff seemed fine, so I dug a little deeper and it seems like the result we're getting from getEnv for environment variable that aren't set has changed in chpl 1.32. I put together a little stand alone chapel program that has all the relevant pieces from arkouda

use CTypes;

// for chpl 1.32, use:
type c_string_ptr = c_ptrConst(c_char);
// for chpl 1.31, use:
// type c_string_ptr = c_string;

proc getEnv(name: string, default=""): string {
    extern proc getenv(name : c_string_ptr) : c_string_ptr;
    var val = getenv(name.localize().c_str()): string;
    if val.isEmpty() { val = default; }
    return val;
}

config const emptyStr: string = getEnv("ARKOUDA_SERVER_CONNECTION_INFO", "");

writeln("emptyStr: ", emptyStr);
writeln("emptyStr.isEmpty(): ", emptyStr.isEmpty());

In both of these ARKOUDA_SERVER_CONNECTION_INFO is unset and CHPL_COMM is gasnet with 1.32:

chpl --version
chpl version 1.32.0
  built with LLVM version 15.0.7

chpl PlayingAround.chpl -o playing

./playing -nl 1
emptyStr: 0x0
emptyStr.isEmpty(): false

with 1.31:

chpl --version
chpl version 1.31.0
  built with LLVM version 15.0.7

chpl PlayingAround.chpl -o playing

./playing -nl 1
emptyStr:
emptyStr.isEmpty(): true

EDIT: It's worth noting that the proc getEnv was added to arkouda by @ronawho 4 years ago, and now there is a getenv method. I was hoping that replacing the extern proc with the one from OS.POSIX would fix it, but it just cause 1.31 to give the same answer as 1.32

stress-tess commented 1 year ago

It seems switching to the getenv from OS.POSIX and doing an equality comparison against nil will fix things. I'm far from convinced this is the most elegant solution

ronawho commented 1 year ago

Hmm, nice investigating. Seems like core issue is that a c_nil:string is no longer considered an empty string. That's kinda surprising, but maybe it's now considered a single element null-terminated string? Checking against nil is certainly an easy fix and may be worth doing in the short term. When I get a chance I'll check with others on the Chapel team to see if that change in behavior is expected.