Closed ExtReMLapin closed 1 month ago
I don't think the CUDA backend explicitly allocates any memory for the unused devices, it may be memory allocated by the CUDA runtime for things like kernels and globals.
Same here on Windows. If I set LLAMA_SPLIT_MODE_NONE it will allocate memory on my second GPU, I assume it will do this for all visible gpu's. If I set LLAMA_SPLIT_MODE_ROW and zeros for ignored gpus it will don't allocate any memory on them. Instead llamacpp allocates memory in ram, size of buffer is kv buffer size. In output I see CUDA_Host and CUDA1 buffers each time. As I assume there should be only Host.
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
Hello, In some configurations, using CUDA_VISIBLE_DEVICES is not something we can consider. Even if you set a temporary env variable right before starting llama.cpp (thru python bindings), if CUDA was used for this main process before, once, it will be stuck on the previous CUDA_VISIBLE_DEVICE value, so CUDA_VISIBLE_DEVICE is not the solution here.
If , thru tensor-split arguments/params we decide not entirely ignore a gpu (example
-ts 1,1,0
) it will still allocate few hundreds of MB on it.Example :
Before starting llama.cpp:
after
Name and Version
What operating system are you seeing the problem on?
Linux
Relevant log output
No response