RuntimeError: CUDA error: no kernel image is available for execution on the device

PaulineJac commented 2 years ago

I had this error when I was trying to train my model :

(Topenv) [jacsont@gpu020.baobab GT_test]$  ketos train -f alto -d cuda Top_test_bin_alto/*xml
WARNING:root:Torch version 1.10.0+cu102 has not been tested with coremltools. You may run into unexpected errors. Torch 1.9.1 is the most recent version that has been tested.
Building training set  [####################################]  184/184          
Building validation set  [####################################]  34/34          [143.7313] alphabet mismatch: chars in training set only: {'*', 'J', 'T', '5', 'σ', 'Q', '?', '8', 'ò', 'v', '›', 'œ', ']', '6', 'Η', '⸤', 'Τ', '̆', '0', '(', ';', '∧', 'ŭ', '1', 'ë', '῀', 'ρ', 'U', 'ὸ', '‹', 'ἀ', ')', '7', '4', 'π', 'Ꝝ', '[', 'ā', '3', 'ς', 'ↄ', '̀', 'Υ', 'ᾶ', 'Z', 'Κ', 'Æ'} (not included in accuracy test during training) 
Initializing model ✓
stage 1/∞  [------------------------------------]  0/184
Traceback (most recent call last):
  File "/home/users/j/jacsont/Topenv/bin/ketos", line 8, in <module>
    sys.exit(cli())
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/kraken/ketos.py", line 603, in train
    trainer.run(_print_eval, _draw_progressbar)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/kraken/lib/train.py", line 457, in run
    self.model.to(self.device)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/kraken/lib/vgsl.py", line 174, in to
    self.nn = self.nn.to(device)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 189, in _apply
    self.flatten_parameters()
  File "/home/users/j/jacsont/Topenv/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 175, in flatten_parameters
    torch._cudnn_rnn_flatten_weight(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

pkzli commented 2 years ago

Hi, I suspect the module torch does not have a compiled module for A100 GPUs. Could you try to either run the task on Yggdrasil or add the parameter --exclude=gpu[020,022] when requesting resources to avoid using A100 GPUs ? If this solve the problem, we'll have to see if we can manually compile this module.

edit : you can check the GPUs types of the compute nodes here : https://doc.eresearch.unige.ch/hpc/hpc_clusters#compute_nodes

gabays commented 2 years ago

Should add this parameter to the submission script then? Which should start with:

#!/bin/env bash
#SBATCH --partition=shared-gpu
#SBATCH --time=01:00:00
#SBATCH --gpus=1
#SBATCH --output=kraken-%j.out
#SBATCH --mem=0
#SBATCH --exclude=gpu[020,022]

pkzli commented 2 years ago

Yes, or use Yggdrasil while it seems that torch works on all GPUs of Yggdrasil

PaulineJac commented 2 years ago

Thank you very much, everything works with Yggdrasil.
I just trained my first model 🥳!

FoNDUE-HTR / Documentation

RuntimeError: CUDA error: no kernel image is available for execution on the device #1