ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.22k stars 9.65k forks source link

Bug: On a 3 GPU System [A-C] not using CUDA_VISIBLE_DEVICES but using tensor split [1,1,0] should not allocate ANY memory on GPU C #8827

Closed ExtReMLapin closed 1 month ago

ExtReMLapin commented 3 months ago

What happened?

Hello, In some configurations, using CUDA_VISIBLE_DEVICES is not something we can consider. Even if you set a temporary env variable right before starting llama.cpp (thru python bindings), if CUDA was used for this main process before, once, it will be stuck on the previous CUDA_VISIBLE_DEVICE value, so CUDA_VISIBLE_DEVICE is not the solution here.

If , thru tensor-split arguments/params we decide not entirely ignore a gpu (example -ts 1,1,0) it will still allocate few hundreds of MB on it.

Example :

Before starting llama.cpp:


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
| 30%   33C    P8             27W /  450W |     383MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:03:00.0 Off |                  Off |
| 30%   34C    P8             31W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off |   00000000:68:00.0 Off |                  Off |
| 30%   33C    P8             26W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2407      G   /usr/lib/xorg/Xorg                            163MiB |
|    0   N/A  N/A      2551      G   /usr/bin/gnome-shell                           45MiB |
|    0   N/A  N/A      4088      G   ...irefox/4650/usr/lib/firefox/firefox         91MiB |
|    0   N/A  N/A      4696      G   ...erProcess --variations-seed-version         55MiB |
|    1   N/A  N/A      2407      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      2407      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

after


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
| 30%   40C    P2            212W /  450W |    7837MiB /  24564MiB |     50%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:03:00.0 Off |                  Off |
| 30%   40C    P2            209W /  450W |    7023MiB /  24564MiB |     46%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off |   00000000:68:00.0 Off |                  Off |
| 30%   34C    P2             71W /  450W |     403MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2407      G   /usr/lib/xorg/Xorg                            163MiB |
|    0   N/A  N/A      2551      G   /usr/bin/gnome-shell                           45MiB |
|    0   N/A  N/A      4088      G   ...irefox/4650/usr/lib/firefox/firefox         91MiB |
|    0   N/A  N/A      4696      G   ...erProcess --variations-seed-version         55MiB |
|    0   N/A  N/A     74112      C   ./llama-cli                                  7448MiB |
|    1   N/A  N/A      2407      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A     74112      C   ./llama-cli                                  7006MiB |
|    2   N/A  N/A      2407      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A     74112      C   ./llama-cli                                   386MiB |

Name and Version

~/llama.cpp$ ./llama-cli --version
version: 3504 (e09a800f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

slaren commented 3 months ago

I don't think the CUDA backend explicitly allocates any memory for the unused devices, it may be memory allocated by the CUDA runtime for things like kernels and globals.

Bzzzzzzz commented 2 months ago

Same here on Windows. If I set LLAMA_SPLIT_MODE_NONE it will allocate memory on my second GPU, I assume it will do this for all visible gpu's. If I set LLAMA_SPLIT_MODE_ROW and zeros for ignored gpus it will don't allocate any memory on them. Instead llamacpp allocates memory in ram, size of buffer is kv buffer size. In output I see CUDA_Host and CUDA1 buffers each time. As I assume there should be only Host.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.