CUDA_ERROR_INVALID_SOURCE error when running on different GPUs

jeanfeydy commented 2 years ago

(This issue is transferred from https://github.com/jeanfeydy/geomloss/issues/66, opened by @ismedina)

I am trying to use geomloss in the computer cluster at my institution. I can choose between several computing nodes with different GPUs. geomloss seems to work seamlessly on some GPUs (GTX980, GTX1080), but on others (RTX500, V100) I get the following error when running the sample code at geomloss webpage:

[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_SOURCE

SamplesLoss()
Traceback (most recent call last):
  File "run-geomloss-samples.py", line 11, in <module>
    L = loss(x, y)  # By default, use constant weights = 1/number of samples
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/geomloss/samples_loss.py", line 265, in forward
    values = routines[self.loss][backend](
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/geomloss/sinkhorn_samples.py", line 656, in sinkhorn_multiscale
    f_aa, g_bb, g_ab, f_ba = sinkhorn_loop(
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/geomloss/sinkhorn_divergence.py", line 462, in sinkhorn_loop
    g_ab = damping * softmin(eps, C_yx, a_log)  # a -> b
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/geomloss/sinkhorn_samples.py", line 450, in softmin_multiscale
    return -eps * log_conv(
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 624, in __call__
    out = GenredAutograd.apply(
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 78, in forward
    myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/keopscore/utils/Cache.py", line 66, in __call__
    self.library[str_id] = self.cls(params, fast_init=True)
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
    self.init_phase2()
  File "/usr/users/medinasuarez/.local/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 23, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.

I am running the code on a Linux machine with Python 3.8, the latest version of geomloss and CUDA 11.5. Do you have any tips?

Thanks a lot in advance :)

jeanfeydy commented 2 years ago

Hi @ismedina,

Thanks again for your report. Here are some hypotheses about your problem:

KeOps stores compiled binaries in your home folder, in ~/.cache/pykeops2.1/.... What may happen is that your cache folder currently contains binaries that have been compiled for the GTX980/1080 GPUs but that are not suited for the RTX500 and the V100, which confuses the system. I don't know if we are currently handling heterogeneous configurations as cleanly as we should (@bcharlier, @joanglaunes ?). To see if this is indeed the root cause for your issue, you may try to log on your RTX500 or V100 GPU and try to run:

import pykeops
# Clear ~/.cache/pykeops2.1/...
pykeops.clean_pykeops()
# Rebuild from scratch the required binaries
pykeops.test_torch_bindings()

There may be a strange issue with CUDA 11.5 or conflicting versions of CUDA that are installed on your system - I have experience with CUDA 10.2, 11.3 and 11.6 so this is pretty unlikely... But you never know as CUDA updates sometimes introduce strange regressions. Does your institutional cluster allow you to use containers with Docker and/or Singularity? If yes, you may read our new installation instructions with containers. These should allow you to test your hardware configuration with a 100% clean and reproducible software stack.

Depending on your answers, we will try to investigate further :-) Best regards, Jean

ismedina commented 2 years ago

Hi Jean,

it was the first thing :) cleaning the compiled binaries when changing GPU solved the issue. Thanks a lot!

Best, Ismael

jeanfeydy commented 2 years ago

Hi @ismedina,

Great, thank you! I assume that @joanglaunes or @bcharlier will know how to fix this cleanly after the summer holidays :-) Best regards, Jean

getkeops / keops

CUDA_ERROR_INVALID_SOURCE error when running on different GPUs #259