NVIDIA-Merlin / core

Core Utilities for NVIDIA Merlin
Apache License 2.0
19 stars 14 forks source link

Run with import without gpu #261

Closed jperez999 closed 1 year ago

jperez999 commented 1 year ago

This PR enables a user to run our GPU docker container without GPUs and successfully be able to run all tests. Previously we had try catches in a few places in the code for importing GPU specific packages. This PR remediates those try catches because in an environment where you have the package but no GPU you end up getting ccuda.pyx init failure. This is because, the package (i.e. cudf, dask_cudf, rmm) do exist but when they try to access information about GPUs it fails and throws an error something like:

collecting ... Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInitGlobal'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Fatal Python error: Segmentation fault

This PR leverages the compat file, and makes it the single point of import for the main packages (cudf, cupy) and it adds a security around it that ensures you can only import those packages if GPUs are detected. So if you find yourself in a scenario where the package is installed but no GPUs are detected you can, now, still safely use the core package. Therefore, we can now run our containers with and without the --gpus=all flag in docker. This was a customer ask and it helps developers when trying to test cpu only environment on a resource that has GPUs.

oliverholworthy commented 1 year ago

Fatal Python error: Segmentation fault

The segmentation fault was fixed in cuda-python 11.7.1 , however that requires use of cudf 22.12 and above.

we had a discussion about this before here https://github.com/NVIDIA-Merlin/core/pull/244#discussion_r1134085639

oliverholworthy commented 1 year ago

Having the cudf imports wrapped with HAS_GPU means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example

oliverholworthy commented 1 year ago

I've been recently working toward a similar goal in https://github.com/NVIDIA-Merlin/core/pull/244 , https://github.com/NVIDIA-Merlin/core/pull/243 and https://github.com/NVIDIA-Merlin/core/pull/236

oliverholworthy commented 1 year ago

Part of those changes involved adidng checks for HAS_GPU where relevant alongside checks of cudf. At one point HAS_GPU was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in https://github.com/NVIDIA-Merlin/core/pull/211 . And I suppose this is moving what was once HAS_GPU into compat.cudf. Which is clearer than what we had before in the HAS_GPU variable.

One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf includes this notion. On the other hand, being able to refer to compat.cudf without checking compat.HAS_GPU does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.

jperez999 commented 1 year ago

Having the cudf imports wrapped with HAS_GPU means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example

I mean, we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?

jperez999 commented 1 year ago

Part of those changes involved adidng checks for HAS_GPU where relevant alongside checks of cudf. At one point HAS_GPU was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in #211 . And I suppose this is moving what was once HAS_GPU into compat.cudf. Which is clearer than what we had before in the HAS_GPU variable.

One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf includes this notion. On the other hand, being able to refer to compat.cudf without checking compat.HAS_GPU does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.

So if there is ever a time we cannot detect a GPU, we should not import cudf. Chances are it will also fail inside on internal imports in cudf. We should work to make sure that we detect GPUs and dask nvml should not have problems detecting GPUs. I have not seen that. We are no longer using pynvml to detect GPU. Since we are now using nvml from dask we are essentially offloading that device detection to dask. I think that is ok. I would like to get a consensus. Thoughts?

jperez999 commented 1 year ago

@oliverholworthy in terms of the edge case you outlined https://github.com/NVIDIA-Merlin/core/blob/ec9a360485bc48e67bb6fe4b5c5ca092eca2a97c/merlin/io/dataset.py#L249-L261 We would first check to see if HAS GPU was available and if it was, then we would check on cudf. This is the exact same as what happens on compat. We first decide if you have GPUs then we try to import the package. I dont see how this is an edge case. The only thing we could change is the error message since we now rely on nvml in dask instead of pynvml.

oliverholworthy commented 1 year ago

we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?

Yeah that's true, aside from the segfault in earlier versions of cudf. you still can't use cudf without a gpu even it it's importable.

oliverholworthy commented 1 year ago

I'm onboard with this change to wrap the imports in compat with HAS_GPU.