RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx

HetagKoroev commented 2 years ago

I have a GTX 1660 SUPER 6 gb vram, Ubuntu 20.04, Python 3.9, Driver Version: 460.91.03 , CUDA Version: 11.2

At this stage: 3%|███▏ | 27/1024 I get an error:

.../ru-dalle/main.py", line 31, in <module>
    _pil_images, _scores = generate_images(text, tokenizer, dalle, vae, top_k=top_k, images_num=images_num, top_p=top_p)
..........
.../ru-dalle/venv/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

The video memory used at the time of the error: 4893MiB / 5936MiB

Also, at the very beginning of generation, I get a warning: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

piratetm commented 2 years ago

I have the same error

ghost commented 2 years ago

I get the same error when pytorch is installed with pip, but it does work with anaconda.

It's not exactly the same issue as this, but seems related. https://github.com/pytorch/pytorch/issues/56747#issuecomment-825559343

shonenkov commented 2 years ago

@HetagKoroev you can try to use examples with inference on GPU 3.5Gb vRAM https://github.com/sberbank-ai/ru-dalle/pull/51

or Jupyter version:

let me known if it helps for you @HetagKoroev

muhammadyusuf-kurbonov commented 2 years ago

@HetagKoroev you can try to use examples with inference on GPU 3.5Gb vRAM #51

or Jupyter version:

let me known if it helps for you @HetagKoroev

Same error

NVidia GTX 1060ti 4GB Zorin OS 16 (Ubuntu 20.03) Driver Version: 470.82.00 CUDA Version: 11.4

muhammadyusuf-kurbonov commented 2 years ago

ruDALL-E batch size: 1
Total GPU RAM: 3.82 Gb
CPU: 8
RAM GB: 7.6
PyTorch version: 1.10.0+cu102
CUDA version: 10.2
cuDNN version: 7605
Allowed GPU RAM: 3.5 Gb
GPU part 0.9162
◼️ Malevich is 1.3 billion params model from the family GPT3-like, that uses Russian language and text+image multi-modality.
tokenizer --> ready
Working with z of shape (1, 256, 32, 32) = 262144 dimensions.
vae --> ready
ruclip --> ready
  0%|          | 1/1024 [00:04<1:23:09,  4.88s/it]/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py:77: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  row_ids = torch.arange(past_length, input_shape[-1] + past_length,
  3%|▎         | 27/1024 [00:07<04:46,  3.48it/s]
Traceback (most recent call last):
  File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 113, in <module>
    codebooks += generate_codebooks(text, tokenizer, dalle, top_k=top_k, images_num=images_num, top_p=top_p,
  File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 87, in generate_codebooks
    logits, has_cache = dalle(out, attention_mask,
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 51, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py", line 122, in forward
    logits = self.to_logits(transformer_output)
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/mnt/d_drive/Projects/AI/ru-dalle/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

shonenkov commented 2 years ago

@muhammadyusuf-kurbonov Could you try to reinstall torch with version 1.7.1 cuda 10.2?

# CUDA 10.2
pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2

muhammadyusuf-kurbonov commented 2 years ago

@muhammadyusuf-kurbonov Could you try to reinstall torch with version 1.7.1 cuda 10.2?
# CUDA 10.2
pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2

Doesn't helped! :disappointed:

ghost commented 2 years ago

Install torch with Anaconda. That solved the problem for me.

muhammadyusuf-kurbonov commented 2 years ago

Install torch with Anaconda. That solved the problem for me.

Doesn't helped too :disappointed:

ghost commented 2 years ago

@muhammadyusuf-kurbonov Do you get the same error message CUBLAS_STATUS_EXECUTION_FAILED with Anaconda? Or a different "out of memory" message?

muhammadyusuf-kurbonov commented 2 years ago

/mnt/d_drive/Projects/AI/Anaconda/bin/python /mnt/d_drive/Projects/AI/ru-dalle/main.py
ruDALL-E batch size: 1
Total GPU RAM: 3.82 Gb
CPU: 8
RAM GB: 7.6
PyTorch version: 1.10.0+cu102
CUDA version: 10.2
cuDNN version: 7605
Allowed GPU RAM: 3.5 Gb
GPU part 0.9162
◼️ Malevich is 1.3 billion params model from the family GPT3-like, that uses Russian language and text+image multi-modality.
tokenizer --> ready
Working with z of shape (1, 256, 32, 32) = 262144 dimensions.
vae --> ready
ruclip --> ready
  0%|          | 1/1024 [00:03<1:05:06,  3.82s/it]/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py:77: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  row_ids = torch.arange(past_length, input_shape[-1] + past_length,
  3%|▎         | 27/1024 [00:06<04:08,  4.01it/s]
Traceback (most recent call last):
  File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 113, in <module>
    codebooks += generate_codebooks(text, tokenizer, dalle, top_k=top_k, images_num=images_num, top_p=top_p,
  File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 87, in generate_codebooks
    logits, has_cache = dalle(out, attention_mask,
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 51, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py", line 122, in forward
    logits = self.to_logits(transformer_output)
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/muhammadyusuf/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

With anaconda I tried both cudatoolkits (10.2 and 11.3)

ghost commented 2 years ago

@muhammadyusuf-kurbonov

Use anaconda to set up your environment like this.

conda create --name rudalle
conda activate rudalle
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c conda-forge transformers youtokentome omegaconf einops matplotlib psutil
pip install taming-transformers more_itertools PyWavelets

muhammadyusuf-kurbonov commented 2 years ago

@muhammadyusuf-kurbonov

Use anaconda to set up your environment like this.

conda create --name rudalle
conda activate rudalle
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c conda-forge transformers youtokentome omegaconf einops matplotlib psutil
pip install taming-transformers more_itertools PyWavelets

(rudalle) muhammadyusuf@muhammadyusuf-IdeaPad-Gaming-3-15IMH05:/mnt/d_drive/Projects/AI/ru-dalle$ python main.py ruDALL-E batch size: 1 Total GPU RAM: 3.82 Gb CPU: 8 RAM GB: 7.6 PyTorch version: 1.10.0 CUDA version: 11.3 cuDNN version: 8200 Allowed GPU RAM: 3.5 Gb GPU part 0.9162 ◼️ Malevich is 1.3 billion params model from the family GPT3-like, that uses Russian language and text+image multi-modality. tokenizer --> ready Working with z of shape (1, 256, 32, 32) = 262144 dimensions. vae --> ready ruclip --> ready 0%|▏ | 1/1024 [00:11<3:21:52, 11.84s/it]/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py:77: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). row_ids = torch.arange(past_length, input_shape[-1] + past_length, 4%|████████▏ | 43/1024 [00:20<07:41, 2.13it/s] Traceback (most recent call last): File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 113, in codebooks += generate_codebooks(text, tokenizer, dalle, top_k=top_k, images_num=images_num, top_p=top_p, File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 87, in generate_codebooks logits, has_cache = dalle(out, attention_mask, File "/mnt/d_drive/Projects/AI/Anaconda/envs/rudalle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 51, in forward return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), *kwargs)) File "/mnt/d_drive/Projects/AI/Anaconda/envs/rudalle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py", line 119, in forward transformer_output, present_has_cache = self.transformer( File "/mnt/d_drive/Projects/AI/Anaconda/envs/rudalle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/transformer.py", line 94, in forward hidden_states, present_has_cache = layer(hidden_states, mask, has_cache=has_cache, use_cache=use_cache) File "/mnt/d_drive/Projects/AI/Anaconda/envs/rudalle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/transformer.py", line 187, in forward output = layernorm_input + mlp_output RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 3.82 GiB total capacity; 2.54 GiB already allocated; 6.12 MiB free; 3.50 GiB allowed; 2.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF



Error has been changed :smile:

ghost commented 2 years ago

@muhammadyusuf-kurbonov You can try setting use_cache=False in the generate_codebooks step, as suggested here: https://github.com/sberbank-ai/ru-dalle/issues/18#issuecomment-967176880

However, you should also try running it with fp16=False, use_cache=True on device='cpu'. For me, I can generate images 4-5x faster compared to device=cuda without the cache.

muhammadyusuf-kurbonov commented 2 years ago

Without cache it runs until 41% :clap: :clap: :clap:

(rudalle) muhammadyusuf@muhammadyusuf-IdeaPad-Gaming-3-15IMH05:/mnt/d_drive/Projects/AI/ru-dalle$ python main.py
ruDALL-E batch size: 1
super-resolution: False
Total GPU RAM: 3.82 Gb
CPU: 8
RAM GB: 7.6
PyTorch version: 1.10.0
CUDA version: 11.3
cuDNN version: 8200
Allowed GPU RAM: 3.5 Gb
GPU part 0.9162
◼️ Malevich is 1.3 billion params model from the family GPT3-like, that uses Russian language and text+image multi-modality.
tokenizer --> ready
Working with z of shape (1, 256, 32, 32) = 262144 dimensions.
vae --> ready
ruclip --> ready
  0%|                                        | 1/1024 [00:12<3:26:18, 12.10s/it]/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/model.py:77: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  row_ids = torch.arange(past_length, input_shape[-1] + past_length,
 41%|████████████████▌                       | 423/1024 [17:46<25:14,  2.52s/it]
Traceback (most recent call last):
  File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 115, in <module>
    codebooks += generate_codebooks(text, tokenizer, dalle, top_k=top_k, images_num=images_num, top_p=top_p, bs=DALLE_BS)
  File "/mnt/d_drive/Projects/AI/ru-dalle/main.py", line 90, in generate_codebooks
    logits, has_cache = dalle(out, attention_mask,
  File "/mnt/d_drive/Projects/AI/Anaconda/envs/rudalle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 51, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 42, in fp16_to_fp32
    return conversion_helper(val, float_conversion)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 15, in conversion_helper
    rtn = [conversion_helper(v, conversion) for v in val]
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 15, in <listcomp>
    rtn = [conversion_helper(v, conversion) for v in val]
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 14, in conversion_helper
    return conversion(val)
  File "/mnt/d_drive/Projects/AI/ru-dalle/rudalle/dalle/fp16.py", line 40, in float_conversion
    val = val.float()
RuntimeError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 3.82 GiB total capacity; 2.53 GiB already allocated; 40.12 MiB free; 3.50 GiB allowed; 2.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory

muhammadyusuf-kurbonov commented 2 years ago

It works on CPU. But difference in performance is not so big (CPU is Intel i5-10300H).

gwyanCN commented 2 years ago

I guess maybe your gpu memory is not enough you can try run your code on one bigger gpu device or small you model

ai-forever / ru-dalle

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx #11