chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Large GPU Memory Usage in Idle Device #23466

Open xianghao-wang opened 1 year ago

xianghao-wang commented 1 year ago

Environment

chpl --version

chpl version 1.32.0 pre-release (5a93ea08f9)
  built with LLVM version 15.0.7
  available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2023 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Summary of Problem

The idle device has a large GPU memory usage with the following codes as shown in the following picture.

module Test1 {
  proc hang() {
    while true {};
  }

  proc main() {
    hang();
  }
}
Screenshot 2023-09-20 at 13 51 40

Steps to Reproduce

I try to set CHPL_RT_NUM_THREADS_PER_LOCALE to indicates I will only use some of these GPUs. With setting CHPL_RT_NUM_THREADS_PER_LOCALE=0, I still get the extra memory usage on idle devices.

Screenshot 2023-09-20 at 13 59 01
e-kayrakli commented 1 year ago

Thanks for the issue @xianghao-wang! I can reproduce locally.

I think this has to do with us setting up the CUDA context per device and loading the GPU binary at application startup. Changing our driver initialization to only cuInit(0), I am getting 3MB of memory usage on my system, whereas by default I get slightly over 100MB. I guess the exact amount is a function of the device characteristics and number and GPU kernels Chapel compiler generates. I am aware that your code doesn't have any GPU kernels, but we do have forall/foreach loops in standard libraries that we generate kernels for.

Has this been an issue for you, or were you just surprised to see the memory usage? In the past, we've considered initializing the GPU driver/module lazily after the first operation on the GPU. We didn't have any motivation to pursue that, so we shelved the idea. Moreover, we were (and still are) a bit afraid of the potential performance hit at the first GPU operation because of that initialization.

In terms of initialization performance, @stonea has recently found out that cuInit calls can be really costly (couple of seconds) if the system is not running NVIDIA persistence daemon. However, the actual memory cost here is coming from what we do after cuInit and we can still cuInit at application startup time and wait for the rest of the initialization. I still expect that to cost some time during the first GPU op. Looking at our internal notes, It is 0.3ish seconds, which can be considered a high cost. Our empty kernel launch time is in the order of microseconds.

So, even if we take the lazy initialization path, I think we should make that an optional behavior rather than the default. But going back to my very first question: if this has more noticeable issues in real benchmarks/applications, we'd be very interested to learn more about it. Note that even if we do lazy initialization, I'd expect to see same amount of memory persistently kept allocated after the first GPU operation.

xianghao-wang commented 1 year ago

Thanks, Engin. I agree that the lazy initialisation may be not a preferred default behaviour.

Without the considering the way how the device is initialised either lazily or eagerly, there is an inconsistency between the meaning of CHPL_RT_NUM_THREADS_PER_LOCALE and the its actual runtime behaviour.

I tested on my machine and cuInit only takes few mega bytes on each device and this small amount of memory does not matter. However, in the initialisation phase, it traverses each device and calls cuDevicePrimaryCtxRetain which brings a large memory usage around 256 MB. This may cause some disturbs especially when 2 users share the a few GPU devices. With setting CHPL_RT_NUM_THREADS_PER_LOCALE, it is expected only a subdomain of GPUs restrained by the environment variable are traversed, rather than all devices.

I was wondering whether you could assign this issue to me, and I will show my codes which make more sense. Thanks for your help.

e-kayrakli commented 1 year ago

there is an inconsistency between the meaning of CHPL_RT_NUM_THREADS_PER_LOCALE and the its actual runtime behaviour.

First, to be clear, you mean CHPL_RT_NUM_GPUS_PER_LOCALE, correct?

For that, yeah, I agree. To give a bigger context: we started our runtime implementation hardwired for NVIDIA. Then modularized it to cover AMD and Intel down the road. As part of that some of the logic has been moved out of driver wrapper layer (where cuDevicePrimaryCtxRetain is called) to the higher layer in the runtime (runtime/src/chpl-gpu.c). However, initialization and kernel launching has been left in the driver wrapper layer. The latter for a relatively good reason: things may change with Intel, so we'll refactor after we add Intel runtime support. Initialization doesn't have a good reason for being there anymore, but it is just something we haven't gotten around.

Why does that matter? Because our higher layer first initializes the implementation layer (where drivers are wrapped) which goes on initializing all devices. Only after that we go and check the value in CHPL_RT_GPUS_PER_LOCALE to mask the GPUs that we don't want to expose to the user. See how chpl_gpu_init is structured.

I think the ideal solution here is to divide chpl_gpu_impl_init into things like chpl_gpu_impl_init_driver and chpl_gpu_impl_init_device. Then, based on the lazy initialization choice and CHPL_RT_NUM_GPUS_PER_LOCALE, we can call chpl_gpu_impl_device_init from chpl_gpu_init. Or (1) have chpl_gpu_impl_use_device call it or (2) create chpl_gpu_use_device that calls chpl_gpu_impl_device_init and chpl_gpu_impl_use_device

This may cause some disturbs especially when 2 users share the a few GPU devices.

I can see this being problematic, yes. However, CHPL_RT_NUM_GPUS_PER_LOCALE will use the first N gpus, so it wouldn't suffice to solve the problem. You might want to try CUDA_VISIBLE_DEVICES to have better control over the devices you use vs others use if you want to coordinate on a shared system for example. Note that we haven't tried using it as far as I am aware. I am not sure if it impacts how CUDA Driver API (which we use) acts vs just masking things at CUDA Runtime API (which a typical CUDA user uses).

I was wondering whether you could assign this issue to me, and I will show my codes which make more sense. Thanks for your help.

I am not sure what you are suggesting here. Do you want to work on a solution?

xianghao-wang commented 11 months ago

Hello, Engin. I am curious about what the actual meaning of CHPL_RT_NUM_GPUS_PER_LOCALE. When CHPL_RT_NUM_GPUS_PER_LOCALE is less than the number of visible GPUs on each locale, the rest of idle GPUs still get initialised, which results in a memory footprint. I do have a patch to only initialise the GPUs with indices less than CHPL_RT_NUM_GPUS_PER_LOCALE, which eliminates the memory usage on other idle GPUs. The problem that you raise above, that multiple locales on a node will all use the same set of GPUs, makes me wonder why we would ever use this environment variable instead of CUDA_VISIBLE_DEVICES.

e-kayrakli commented 11 months ago

I am curious about what the actual meaning of CHPL_RT_NUM_GPUS_PER_LOCALE. When CHPL_RT_NUM_GPUS_PER_LOCALE is less than the number of visible GPUs on each locale, the rest of idle GPUs still get initialised, which results in a memory footprint. I do have a patch to only initialise the GPUs with indices less than CHPL_RT_NUM_GPUS_PER_LOCALE, which eliminates the memory usage on other idle GPUs.

This is correct, and your patch is definitely a step in the right direction.

The initial motivation for CHPL_RT_NUM_GPUS_PER_LOCALE was for the cpu-as-device mode. In that mode, we are pretending that there are some GPUs on the node, even though it may not be the case. So, we need to know how many GPUs should we fake, essentially. To make that environment variable meaningful, we limited it to mean number of GPUs that runtime exposes to the user if there are any actual GPUs. And that's the behavior today. Your patch would fix an obvious oversight.

The problem that you raise above, that multiple locales on a node will all use the same set of GPUs, makes me wonder why we would ever use this environment variable instead of CUDA_VISIBLE_DEVICES.

Because we strive to be portable. Considering only NVIDIA and AMD, we do have both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES. But we should provide a more portable alternative. While not exposing any of the details of the CUDA and HIP APIs, we cannot ask users to use those environment variables in my view, at least not in the long term -- I agree that they are good and valid workarounds while we get our story right.

Going back to my previous answer: how can we achieve the same thing in the cpu-as-device mode? So, ideally, there should be a single environment variable that result in the same behavior across vendors and the cpu-as-device mode for the best user experience. Note that I don't know Intel's solution for this, and that we want to support Intel GPUs in the near future, too.

bradcray commented 11 months ago

Because we strive to be portable. Considering only NVIDIA and AMD, we do have both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES. But we should provide a more portable alternative. While not exposing any of the details of the CUDA and HIP APIs, we cannot ask users to use those environment variables in my view, at least not in the long term -- I agree that they are good and valid workarounds while we get our story right.

I just want to reinforce Engin's message here. Chapel's typical practice is to come up with vendor- and technology-neutral ways to say things that we care about that apply to multiple vendors—in this case, a vendor-neutral way of specifying how many devices to use / are visible. When done right, these Chapel analogs to the vendor-specific capabilities can and should take their defaults from the vendor variables if they are set and the Chapel ones are not. So, for example, CHPL_RT_NUM_GPUS_PER_LOCALE, if unset, could take its default from CUDA_VISIBLE_DEVICES on an NVIDIA platform and HIP_VISIBLE_DEVICES on AMD. If it was set, standard practice would be for it to take precedent over the vendor-specific settings, though it could also warn if both were set, and to divergent values.