Closed jperez999 closed 1 year ago
Fatal Python error: Segmentation fault
The segmentation fault was fixed in cuda-python 11.7.1 , however that requires use of cudf 22.12 and above.
we had a discussion about this before here https://github.com/NVIDIA-Merlin/core/pull/244#discussion_r1134085639
Having the cudf
imports wrapped with HAS_GPU
means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example
I've been recently working toward a similar goal in https://github.com/NVIDIA-Merlin/core/pull/244 , https://github.com/NVIDIA-Merlin/core/pull/243 and https://github.com/NVIDIA-Merlin/core/pull/236
Part of those changes involved adidng checks for HAS_GPU
where relevant alongside checks of cudf
. At one point HAS_GPU
was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in https://github.com/NVIDIA-Merlin/core/pull/211 . And I suppose this is moving what was once HAS_GPU
into compat.cudf
. Which is clearer than what we had before in the HAS_GPU
variable.
One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf
includes this notion. On the other hand, being able to refer to compat.cudf
without checking compat.HAS_GPU
does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.
Having the
cudf
imports wrapped withHAS_GPU
means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example
I mean, we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?
Part of those changes involved adidng checks for
HAS_GPU
where relevant alongside checks ofcudf
. At one pointHAS_GPU
was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in #211 . And I suppose this is moving what was onceHAS_GPU
intocompat.cudf
. Which is clearer than what we had before in theHAS_GPU
variable.One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if
cudf
includes this notion. On the other hand, being able to refer tocompat.cudf
without checkingcompat.HAS_GPU
does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.
So if there is ever a time we cannot detect a GPU, we should not import cudf. Chances are it will also fail inside on internal imports in cudf. We should work to make sure that we detect GPUs and dask nvml should not have problems detecting GPUs. I have not seen that. We are no longer using pynvml to detect GPU. Since we are now using nvml from dask we are essentially offloading that device detection to dask. I think that is ok. I would like to get a consensus. Thoughts?
@oliverholworthy in terms of the edge case you outlined https://github.com/NVIDIA-Merlin/core/blob/ec9a360485bc48e67bb6fe4b5c5ca092eca2a97c/merlin/io/dataset.py#L249-L261 We would first check to see if HAS GPU was available and if it was, then we would check on cudf. This is the exact same as what happens on compat. We first decide if you have GPUs then we try to import the package. I dont see how this is an edge case. The only thing we could change is the error message since we now rely on nvml in dask instead of pynvml.
we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?
Yeah that's true, aside from the segfault in earlier versions of cudf. you still can't use cudf without a gpu even it it's importable.
I'm onboard with this change to wrap the imports in compat
with HAS_GPU
.
This PR enables a user to run our GPU docker container without GPUs and successfully be able to run all tests. Previously we had try catches in a few places in the code for importing GPU specific packages. This PR remediates those try catches because in an environment where you have the package but no GPU you end up getting ccuda.pyx init failure. This is because, the package (i.e. cudf, dask_cudf, rmm) do exist but when they try to access information about GPUs it fails and throws an error something like:
This PR leverages the compat file, and makes it the single point of import for the main packages (cudf, cupy) and it adds a security around it that ensures you can only import those packages if GPUs are detected. So if you find yourself in a scenario where the package is installed but no GPUs are detected you can, now, still safely use the core package. Therefore, we can now run our containers with and without the
--gpus=all
flag in docker. This was a customer ask and it helps developers when trying to test cpu only environment on a resource that has GPUs.