Open yinjiaoyuan opened 1 year ago
好像关闭secureCRT终端后显存使用率飙升的很厉害,打满了,从而导致服务不能完成:
Mon May 29 22:12:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 44C P0 26W / 70W | 13057MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 22580 C python 13053MiB | +-----------------------------------------------------------------------------+
不关闭secureCRT终端的显存情况:
Mon May 29 22:16:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 32W / 70W | 8343MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 23416 C python 8339MiB | +-----------------------------------------------------------------------------+
总感觉关闭secureCRT终端后理解几次图片会导致显存会泄漏。
失败时的log: 2023-05-29 23:47:01,781 - /opt/VisualGLM-6B/web_demo.py[line:54] - INFO: error: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.61 GiB total capacity; 13.24 GiB already allocated; 43.81 MiB free; 13.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
我通过secureCRT终端ssh登录到我的GPU服务器,切换到VisualGLM-6B的conda环境conda activate VisualGLM-6B,然后后台运行python web_demo.py --quant 4 &,没有关闭secureCRT终端时我用浏览器访问识别图片内容是正常的,一旦我关闭了secureCRT终端后再识别图片浏览器右上角就会提示: Something went wrong Expecting value: line 1 column 1 (char 0) 这种问题不好跟踪,因为我关闭了secureCRT终端,看不到日志,请问这是什么问题导致的呢?或者有什么日志协助吗?谢谢。