[Bug] LLaVA Captioner appears to leak VRAM

curiousjp commented 5 months ago

Hi - thank you for this excellent node. I use it as part of a two stage SDXL workflow where an image prompt is generated using a grammar, with the resulting image analysed by LLaVA to extract a derived prompt, which is then generated from again.

I've noticed that when running long batches of queued prompts this node appears to leak VRAM. I enclose a small sample workflow that generates a random 16x16 tile and interrogates it.

After freshly booting Comfy UI, nvidia-smi shows about 1.4gb of VRAM usage:

Fri Mar 15 19:34:42 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.23                 Driver Version: 551.23         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   48C    P8             25W /  170W |    1405MiB /  12288MiB |     21%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

After queuing the above workflow 30 times, VRAM usage rises to about 8gb:

Fri Mar 15 19:37:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.23                 Driver Version: 551.23         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060      WDDM  |   00000000:01:00.0  On |                  N/A |
| 30%   45C    P8             25W /  170W |    7897MiB /  12288MiB |     27%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

As you can guess, in real use this eventually starts to crowd out my SDXL checkpoints, which must switch to low-VRAM mode. Restarting Comfy releases the memory as expected.

It seems likely to me that the problem is actually in llama-cpp-python, and there have been a number of threads online about potential vram leaks there - I'm not able to conclusively work out if this is one of them. One suggestion online has been to offload the inference task into a separate process (using the multiprocessing module or a separate python script, I guess) but this looked non-trivial and I wanted to get your thoughts on it before attempting to build a solution and PR it.

In the meantime, is there some way to force llama-cpp-python to run this on cpu only without rebuilding the library, and to make that a toggle option on the node itself?

RangerFrank commented 5 months ago

Running into the same issue of VRAM memory leak. Using multiple LLaVA nodes to interrogate an image three ways, processing images in batch or sequence eventually results in a CUDA out of memory error.

2024-03-26 13:25:14.9261455 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Conv node. Name:'StatefulPartitionedCall/ConvNextBV2/block0_cell0_conv2d_02/BiasAdd' Status Message: D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:114 onnxruntime::CudaCall CUDA failure 2: out of memory ; GPU=0 ; hostname=DESKTOP-F006S5I ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);

curiousjp commented 5 months ago

I did eventually write a free standing Python script to read a prompt from stdin and write the results to stdout, and then modified the captioner to run that using subprocess - system ensures the vram is cleaned up that way. It’s not especially clean or neat but I can gist it later if it would be useful.

RangerFrank commented 5 months ago

That would be amazing, I'm attempting to use the node to interrogate a large dataset of images and the memory issue is a big hindrance as I have to babysit the process.

curiousjp commented 5 months ago

Here's the gist - you may need to tweak some paths, but I'm sure you can figure it out. I have the invoke_llava script sitting in the same folder I execute run_nvidia_gpu.bat from. For cerulean's original code in llava.py, I have marked my changes with "#spf" (for reasons I can no longer recall, but I think was originally "subprocess fork".)

https://gist.github.com/curiousjp/99d9bf94f55748a5381dad560e9ed0a6

Smuzzies commented 1 month ago

Confirmed it's leaking like a busted faucet when called in a ComfyUI node after a few renders or duing idle time after a single render. (4090 24GVram, System 64Ram)

ByblosHex commented 3 weeks ago

Same issue here with the memory leak. Eventually unable to function until I restart my ComfyUI completely.

ceruleandeep / ComfyUI-LLaVA-Captioner

[Bug] LLaVA Captioner appears to leak VRAM #11