Open rjzamora opened 11 months ago
cc @jperez999 @karlhigley
@jperez999 - Do you think this line is actually necessary? If we already have HAS_GPU
from the line above, maybe we can just do:
if not HAS_GPU:
cuda = None
Besides the above I was looking at the code in more detail and I see the following block: https://github.com/NVIDIA-Merlin/core/blob/6e52b48140615708b59926b5f9c3601f8feeab93/merlin/core/compat/__init__.py#L102-L105
This is creating a new context on a GPU only to query memory size, and CUDA context should never be addressed before Dask initializes the cluster. Also note in the pynvml_mem_size
there's an equivalent code block: https://github.com/NVIDIA-Merlin/core/blob/6e52b48140615708b59926b5f9c3601f8feeab93/merlin/core/compat/__init__.py#L57-L60
The PyNVML code will NOT create CUDA context and is safe to run before Dask. Is there a reason why you're using the code block with Numba to query GPU memory instead of always using PyNVML for that?
As pointed out by @oliverholworthy in https://github.com/NVIDIA-Merlin/core/pull/274#discussion_r1160187901,
cuda_isavailable()
is used inmerlin.core.compat
to check for cuda support. Unfortunately, this is a known problem for dask-cuda.This most likely means that Merlin/NVTabular has not worked properly with Dask-CUDA for more than six months now. For example, the following code will produce an OOM error for 32GB V100s:
You will also see an error if you don't import any merlin/nvt code, but use the offending
cuda.is_available()
command:Meanwhile, the code works fine if you don't sue the offending command or import code that also imports
merlin.core.compat
: