CUDA DeviceAllocate segfault

drzraf commented 1 month ago

#0  0x00007bc0622c6554 in std::_Rb_tree_increment(std::_Rb_tree_node_base const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#1  0x00007bc05573e59a in cub::CachingDeviceAllocator::DeviceAllocate(int, void**, unsigned long, CUstream_st*) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#2  0x00007bc05573ea99 in ctranslate2::cuda::CubCachingAllocator::allocate(unsigned long, int) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#3  0x00007bc055712796 in ctranslate2::StorageView::reserve(long) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#4  0x00007bc0557127f8 in ctranslate2::StorageView::resize(std::vector<long, std::allocator<long> >) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#5  0x00007bc0556f59f2 in void ctranslate2::ops::MatMul::compute<(ctranslate2::Device)1, float>(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView&) const ()
   from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#6  0x00007bc055660d24 in ctranslate2::layers::dot_product_attention(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const*, ctranslate2::StorageView const*, ctranslate2::StorageView const*, ctranslate2::StorageView const*, long, ctranslate2::StorageView&, ctranslate2::StorageView*, bool, float, bool, bool, long, ctranslate2::layers::Alibi*, ctranslate2::StorageView*) () from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4
No symbol table info available.
#7  0x00007bc05566208d in ctranslate2::layers::MultiHeadAttention::operator()(ctranslate2::StorageView const&, ctranslate2::StorageView const&, ctranslate2::StorageView const*, ctranslate2::StorageView&, ctranslate2::StorageView*, ctranslate2::StorageView*, ctranslate2::StorageView*, ctranslate2::Padder const*, ctranslate2::Padder const*, bool, ctranslate2::StorageView*, long) const ()
   from /home/.local/lib/python3.10/site-packages/ctranslate2.libs/libctranslate2.so.4

CT2_VERBOSE=3 LD_LIBRARY_PATH=/home/.local/lib/python3.10/site-packages/ctranslate2.libs whisper-ctranslate2 --language=en --verbose=true --model small -f srt --output_dir /tmp/ foo.mp4

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce 940MX           Off |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8             N/A /  200W |    1988MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------

small model (sadly) doesn't hold within my 2GB GPU but causes a segfault instead of failing properly.
happen with both the wheel and a hand-compiled .so
tiny model works (no OOM)
Important and unexpected useful workaround: Setting CT2_CUDA_ALLOW_BF16=1 CT2_CUDA_ALLOW_FP16=1 I could get small to run successfully on this GPU (!)

minhthuc2502 commented 1 month ago

Hello, Do you use the quantization for the small model? Which compute type you use? It seems like this is only OOM problem because you don't have enough VRAM. nvidia-smi only shows you the used memory before the moment that the program crashes. When the program tries to allocate more memory, it exceeds 2GB.

drzraf commented 1 month ago

Tested all of them (with small) without CT2_* env and got ValueError: Requested XXX compute type, but the target device or backend do not support efficient XXX computation. except for float32 which triggers a segfault.
float32 always segfaults
Setting CT2_CUDA_ALLOW_FP16=1 it only works for float16 (others trigger ValueError)
Setting CT2_CUDA_ALLOW_BF16=1, then bfloat16 gives RuntimeError: cuDNN failed with status CUDNN_STATUS_ARCH_MISMATCH (others trigger ValueError)

auto and default select float32:

[2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow INT8: false [2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow FP16: false (with Tensor Cores: false) [2024-05-27 08:57:18.106] [ctranslate2] [thread 3417167] [info] - Allow BF16: false [2024-05-27 08:57:19.253] [ctranslate2] [thread 3417167] [info] Using CUDA allocator: cub_caching [2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Binary version: 6 [2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Model specification revision: 3 [2024-05-27 08:57:19.995] [ctranslate2] [thread 3417167] [info] - Selected compute type: float32

medium segfault even with CT2_CUDA_ALLOW_FP16=1

minhthuc2502 commented 1 month ago

Try the quantization int8 or float16. Your GPU is small to work with medium model float32, it's normal. Bfloat16 only works with GPU 8.x or newer (your GPU could be only 7.x)

OpenNMT / CTranslate2

CUDA DeviceAllocate segfault #1709