chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.75k stars 413 forks source link

Out of Memory Issues Caused by `in` intents #25005

Open ShreyasKhandekar opened 1 month ago

ShreyasKhandekar commented 1 month ago

Summary of Problem

in intents on functions can cause out of memory issues that crash chapel programs.

Description: This issue is prompted by a bug reported by one of the Arkouda devs here: https://github.com/Bears-R-Us/arkouda/issues/3089.

In the case where it was reported, we had the following line in the Arkouda source code:

var temp = makeDistArray(a);

The makeDistArray function is part of the tryCreateArray interface which throws an ArrayOomError error when the allocation fails because we're out of memory.

However, the Arkouda issue above reported that this interface was not working as expected and was causing a server crash even when setting CHPL_GASNET_SEGMENT=fast (which is the configuration in which this interface for throwing on OOM works best.)

Here the makeDistArray(a) has the following function header:

proc makeDistArray(in a: [?D] ?etype) throws where [irrelevant clause]

Now because of this in intent, the array is copied when the function is called, this invokes our chpl__initCopy function in the ChapelArray.chpl module. https://github.com/chapel-lang/chapel/blob/e0d0e9db11d04725cb10396bd00f0acf8eac24b9/modules/internal/ChapelArray.chpl#L3013-L3018

This function does NOT use the tryCreateArray interface and therefore the Arkouda server crashes when the program runs out of memory here.

We should have a discussion about whether this initCopy should be made to use the throwing interface so that the problem can be caught instead of just crashing.

Is this a blocking issue with no known work-arounds? The original issue was resolved by using a different version of makeDistArray which avoided the array being copied using the in intent. Thanks a lot to @jeremiah-corrado for suggesting the workaround and helping resolve this issue!

Steps to Reproduce

I haven't found time to come up with a small reproducer which does not involve Arkouda. But with Arkouda we can do the following in gasnet config

Launch arkouda server:

./arkouda_server -nl 6 --logLevel=LogLevel.DEBUG --ServerDaemon.daemonTypes=ServerDaemonType.DEFAULT,ServerDaemonType.METRICS

(TODO, investigate single locale reproducer)

Run the following commands in the python client:

import arkouda as ak
ak.connect()
pd = ak.randint(131, 131, 1393000000)
ak.argsort(pd)

Now the array size is a tricky argument. In order to get the failure here the array size needs to be:

  1. Large enough that the machine runs out of memory
  2. Small enough such that the overMemLimit function in Arkouda estimates that it will take less memory that it actually does i. This function is how we checked for OOM errors before the tryCreateArray interface and it's estimates aren't always accurate.
  3. Large enough such that the machine runs out of memory at the chpl__initCopy step (i.e. before the makeDistArray function body is actually entered)

On my mac this size was 493000000 but on ChapDL it was 1393000000.

Configuration Information

Reveal - Output of `chpl --version`: ``` chpl version 2.1.0 pre-release (7e181b1d05) built with LLVM version 15.0.4 available LLVM targets: nvptx64, nvptx, x86-64, x86, amdgcn, r600 Copyright 2020-2024 Hewlett Packard Enterprise Development LP Copyright 2004-2019 Cray Inc. (See LICENSE file for more details) ``` - Output of `$CHPL_HOME/util/printchplenv --anonymize`: ``` CHPL_TARGET_PLATFORM: linux64 CHPL_TARGET_COMPILER: llvm CHPL_TARGET_ARCH: x86_64 CHPL_TARGET_CPU: none * CHPL_LOCALE_MODEL: flat CHPL_COMM: gasnet * CHPL_COMM_SUBSTRATE: udp CHPL_GASNET_SEGMENT: fast * CHPL_TASKS: qthreads CHPL_LAUNCHER: amudprun CHPL_TIMERS: generic CHPL_UNWIND: none CHPL_MEM: jemalloc CHPL_ATOMICS: cstdlib CHPL_NETWORK_ATOMICS: none CHPL_GMP: bundled CHPL_HWLOC: bundled CHPL_RE2: bundled CHPL_LLVM: system CHPL_AUX_FILESYS: none ``` -- `gcc --version` ``` gcc (SUSE Linux) 7.5.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ```
mppf commented 1 month ago

We should have a discussion about whether this initCopy should be made to use the throwing interface so that the problem can be caught instead of just crashing.

Is this the main point of this issue? I see the bug and compiler labels but this feels more like a feature request to me. Anyway, if this is the main point, maybe we can give it a different title to focus on that?

The language has a bunch of features that can request a copy of an array. We try to optimize these to some extent, so an in formal won't necessarily cause a copy. (Of course, it did in the problematic case here). There are quite a few more of these than just in intent (var x = y; and returning by value come to mind). I can't imagine removing these (besides, they are stable), and at the end of the day, any language that allows allocating memory will also allow one to write programs that run out of memory. But, we can make tools and warnings and linters etc.

ShreyasKhandekar commented 1 month ago

Ah, Github added the Bug label on its own when I filed it. I changed it to feature request. I think I agree with you that, in general, people can always write programs that run out of memory. I think this issue is just some fodder to think about it in terms of 'should we allow programs to recover from such errors instead of just crash?' and 'What are the cases where we would want to support this?'