Idle timeout for releasing GPU memory

MalikKillian commented 7 months ago

Please describe the feature you want

I use my computer for running Stable Diffusion (ComfyUI) as well as Tabby. I eventually noticed that ComfyUI would run very slowly when Tabby was also running. It appears that Tabby reserves GPU memory and only releases it when the [Docker] container is stopped (not paused). It's a little inconvenient to have to stop Tabby completely to use my GPU for other things.

Tabby should implement a configurable idle-timeout. After X seconds Tabby will release its memory and only reserve it again when a new request comes in. From simple observation it doesn't seem like it takes more than a few seconds to start a TabbyML container and receive a response so I kind of wonder if there's any purpose at all to keeping memory reserved.

Additional context

Tabby version: 0.9.1

Model: StarCoder-3B

Docker Desktop (Windows) version 4.19.0 (106363)

Output from nvidia-smi while Tabby container is idle (no requests for >60 seconds):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.40.06              Driver Version: 551.23         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    On  |   00000000:2B:00.0  On |                  N/A |
| 29%   31C    P8             17W /  175W |    4395MiB /   8192MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1      C   /tabby                                      N/A      |
|    0   N/A  N/A        32      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        35      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        38      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

Output from nvidia-smi while Tabby container is stopped:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.40.06              Driver Version: 551.23         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    On  |   00000000:2B:00.0  On |                  N/A |
| 29%   30C    P8             12W /  175W |     960MiB /   8192MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        32      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        35      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        38      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

Please reply with a 👍 if you want this feature.

wsxiaoys commented 7 months ago

See an prototype implementation here: https://github.com/TabbyML/tabby/issues/624#issuecomment-1778283444

MalikKillian commented 5 months ago

@wsxiaoys Not sure how I'm supposed to leverage this implementation. It doesn't seem to be the default behavior using the Docker container. As far as I can tell it's not even part of the image. 😞

TabbyML / tabby

Idle timeout for releasing GPU memory #1736