chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 421 forks source link

GPU memory error when accessing memory dynamically allocated on GPU from CPU #20081

Open ShreyasKhandekar opened 2 years ago

ShreyasKhandekar commented 2 years ago

Summary of Problem

When we try to access a piece of memory that was dynamically allocated inside of an on block on a GPU after leaving the said block we run into the following error:

GPUMemError.chpl:4: error: halt reached - Trying to free a GPU pointer outside a GPU sublocale

Steps to Reproduce

Source Code:

use List;
class C { var x: string; }

var l = new list(owned C?);

on here.gpus[0] {
  l.pushBack(new owned C("10"));
}

writeln(l[0]);

Compile command: chpl GPUMemError.chpl

Execution command: ./GPUMemError

This happens because in LocaleModel.chpl we halt() when we try to free this memory after exiting the GPU block.

We could probably do something more intelligent here in order to free the memory without causing this issue.

### Configuration Information - Output of `chpl --version`: ```bash ❯ chpl --version warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly chpl version 1.27.0 pre-release (3d0de33be1) built with LLVM version 13.0.0 Copyright 2020-2022 Hewlett Packard Enterprise Development LP Copyright 2004-2019 Cray Inc. (See LICENSE file for more details) ``` - Output of `$CHPL_HOME/util/printchplenv --anonymize`: ```bash CHPL_TARGET_PLATFORM: cray-xc CHPL_TARGET_COMPILER: llvm * CHPL_TARGET_ARCH: x86_64 CHPL_TARGET_CPU: native * CHPL_LOCALE_MODEL: gpu * CHPL_COMM: none * CHPL_TASKS: qthreads CHPL_LAUNCHER: slurm-srun * CHPL_TIMERS: generic CHPL_UNWIND: none CHPL_MEM: jemalloc CHPL_ATOMICS: cstdlib CHPL_GMP: bundled CHPL_HWLOC: bundled CHPL_RE2: bundled CHPL_LLVM: system * CHPL_AUX_FILESYS: none ``` - Back-end compiler and version, e.g. `gcc --version` or `clang --version`: ```bash gcc (GCC) 11.2.0 20210728 (Cray Inc.) Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` - (For Cray systems only) Output of `module list`: ```bash Currently Loaded Modulefiles: 1) modules/3.2.11.4 8) dmapp/7.1.1-7.0.4.0_40.8__gcec52bc.ari 15) cray-mpich/7.7.20 22) atp/3.14.11 2) craype-network-aries 9) xpmem/2.2.29-7.0.4.0_50.1__g35859a4.ari 16) dws/3.0.37-7.0.4.0_67.26__g5d83a9b.ari 23) rca/2.2.21-7.0.4.0_26.1__gb0ce89b.ari 3) nodestat/2.3.89-7.0.4.0_34.11__g8645157.ari 10) llm/21.4.635-7.0.4.0_46.3__g33a55bc.ari 17) cudatoolkit/21.5_11.3 24) perftools-base/22.04.0 4) sdb/3.3.821-7.0.4.0_28.24__g8c59c9d.ari 11) nodehealth/5.6.32-7.0.4.0_81.1__g66010cb.ari 18) gcc/11.2.0 25) PrgEnv-gnu/6.0.11 5) udreg/2.3.2-7.0.4.0_37.17__g5f0d670.ari 12) system-config/3.6.3187-7.0.4.0_53.8__gd6312243.ari 19) craype/2.7.16.5 6) ugni/6.0.14.0-7.0.4.0_28.4__ge0d449e.ari 13) slurm/20.11.5-1 20) cray-libsci/20.09.1 7) gni-headers/5.0.12.0-7.0.4.0_38.10__gd0d73fe.ari 14) Base-opts/2.4.142-7.0.4.0_43.7__g8f27585.ari 21) pmi/5.0.17 ```
ShreyasKhandekar commented 2 years ago

This issue was first encountered while working with the SHOC benchmarks. Specifically, the Chapel versions of the Triad benchmark ran into this issue when we tried to access the objects of the ResultDatabse record from outside of the GPU block.

For now we moved those accesses to be inside the GPU block to work around this issue, but whoever fixes this might also want to change the triad.chpl and its alternate versions to have cleaner code. [Insert link to triad tests here]

e-kayrakli commented 2 years ago

On the surface, I feel like we can just remove the halt and it should work. Though, when we were pair-programming, that didn't help in Shreyas's code.

Also, I am curious what would happen in a multilocale, non-gpu setup for the same code, except the on statement is on Locales[1]? Assuming that it works, does the data structure go to the item's locale while calling its deinit? If that's the case, why doesn't it work for a gpu sublocale? Is there some other mechanism making it work that we don't have with the GPU locale model?

I am also curious whether this is specific to string (and bytes, most likely). It might have to do with the relatively simple check in string.localize. Though I would expect an issue with that to cause an issue in insertion, not deallocation.

Final note, in Shreyas's code, the data structure in question was sortedMap.