Kosinkadink / ComfyUI-VideoHelperSuite

Nodes related to video workflows

GNU General Public License v3.0

543 stars 95 forks source link

Memory Allocation Error When Loading Large Videos in ComfyUI-VideoHelperSuite #163

Open denyshubh opened 7 months ago

denyshubh commented 7 months ago

Describe the bug
A memory allocation error occurs when attempting to load large videos using the ComfyUI-VideoHelperSuite's VHS_LoadVideoPath function on Windows OS. The error is encountered in the load_video_cv function when trying to create a PyTorch tensor from a numpy array of video frames.

To Reproduce
Steps to reproduce the behavior:

Use ComfyUI-VideoHelperSuite to load a large video file.
Encounter the memory allocation error during the loading process.
Tried to use the video with 720*1366 dimension.

Expected behavior
The video should load successfully without causing a memory allocation error, regardless of the video size.

Error Message

Error occurred when executing VHS_LoadVideoPath:
cannot allocate array memory

File "C:\code\ComfyUI_windows_portable\ComfyUI\execution.py", line 152, in recursive_execute
...
File "C:\code\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-VideoHelperSuite\videohelpersuite\load_video_nodes.py", line 131, in load_video_cv
images = torch.from_numpy(np.fromiter(gen, np.dtype((np.float32, (height, width, 3)))))

Environment:

OS: Windows
ComfyUI Version: link
Python Version: 3.10

Additional context
The issue seems related to attempting to load the entire video into memory, which is not feasible for large videos.

Suggested Solution: Batch Processing of Images
To mitigate this issue, I suggest implementing a batch processing mechanism for video frames. This approach involves loading and processing a set number of frames at a time, thereby significantly reducing the memory footprint. Here's a conceptual approach to integrate batch processing:

Modify the cv_frame_generator to yield frames in batches rather than attempting to load all frames into memory at once.
Adjust the consumer of cv_frame_generator to handle these batches accordingly, processing each batch of frames before moving on to the next.

This method would allow ComfyUI-VideoHelperSuite to handle videos of any size by limiting the number of frames in memory at any given time.

AustinMroz commented 7 months ago

I may not be following correctly. Is this somehow different from the already implemented Batch Manager node? That already exists to handle the task of splitting a large video into batches and re-combining the batches afterwards.

Kosinkadink commented 7 months ago

@AustinMroz Hey, I've not had a lot of time to look into Batch Manager, but my impression of it is that it does not quite fix the RAM problem in the intended/intuitive way. I apologize for not being able to provide feedback on this for a long time.

For stuff like AnimateDiff or other techniques that need to have the images as part of the same tensor, since Batch Manager ends up producing lists of tensors (of a particular batch size), Comfy interprets that by running each Batched entry separately. For example, if you were to load 128 frames and load them with Batch Manager with 16 batch length, then it would end making 8 separate runs that get stitched together at the end rather than letting the KSampler handle things with knowledge of the whole intended tensor (that contains all the images/latents as one tensor rather than a list).

The big point of comparison is between Load Images vs Load Video for the same amount of frames. Load Images does not have the same RAM issues, which means that amount of images themselves isn't the issue, but rather whatever goes on in the loop that loads and concatenates the images with opencv/numpy.

I'm not too sure how Batched Manager works just by looking at the code, but it would be for the best if we could figure out the issue in the code that somehow makes Load Video use up so much RAM despite Load Images not having the same issue. Maybe the batch logic in Batch Manager can be repurposed to create a single tensor from the list immediately, since it appears Batch Manager use does fix the memory issue (and the fact all images being combined at the end in Video Combine does not cause a memory issue). Assuming there's no problem with that, we could hardcore a sensible internal batch value so that we can avoid the memory issue altogether and make Batch Manager usage obsolete and get the desired single tensor behavior.

My biggest gripe with custom Comfy nodes when I first got into comfy dev work has been the confusing interchange of lists and batched tensors, and if possible I'd like VideoHelperSuite to not have to play on that level. And in cases where it must, then it must be as clear as possible to the user that the tensors are (functionally) a list instead of a single tensor batch.

AustinMroz commented 7 months ago

Always glad to hear your insight on things and willing to cede to your vision. Projects I've started don't get much buzz.

As a short description of the Batch Manager: It is entirely removed from the ComfyUI latent batches dimension. It just automates re-queuing the entire workflow with a capped number of frames to process a single video input into a single video output.

The whole workflow batching was born of a reasonable request. As others were going so far as to test and open pull requests into the development branch for it, I figured releasing it was in-line with your recommendation to not hold back updates overlong.

I recognize that its "frames_per_batch" produces strict windows that break temporal cohesion, but figured it a best effort implementation at automating what was otherwise already being suggested to users. Communicating functionality and limitations to users could probably be improved, though. Of course, I'd certainly prefer a full solution(mem-mapping latents?) but that seems like it'd break established interfaces too much for my taste.

I, until now, had no reason to believe that load video had worse performance than load images. I had assumed that the greater number of issues opened was merely a consequence of the ease with which users could process large amounts of data, but I did some testing just in case this was hubris getting the better of me. I loaded the first 500 frames of a 1920 x 1080 video with each of the load nodes piped into a quick dummy image output node and measured the total memory used by ComfyUI.

load video single
12,906,324
12,897,020
12,896,940
500 frames on disk: 848,684
load images
24,944,952
24,944,536
24,944,852
3x repeat load video
12,910,108
3x repeat load images
24,896,008

After accounting for imports, this matches my expectations of Load Video using half the memory of Load Images. If there's some underlying issue causing Load Video to use significantly more memory, it's not something I've been able to reproduce.

EDIT: It should be quite doable to

Calculate how many frames can fit in once at system memory
Apply this number automatically if the user hasn't added a Batch Manager node
Allow for this calculated default to be overriden by adding a Batch Manager node with the limit manually set higher.

However, the caching of output will mean that this would need to be aproximately divided by the number of nodes in the workflow that use process image data and that seems like it'd be tricky to get right.

Kosinkadink commented 7 months ago

Gotcha, thanks for looking into it! My impression about the mem issue was due to previous reports about manually exporting frames for a video and then loading them manually, but that was before your memory optimizations, so that likely just isn't reproducible anymore.

The workflow requeueing approach is interesting - does the whole workflow get requeued, or does it do some clever things to only redo certain parts of the workflow? There are definitely a lot of cases where this approach could be useful where there is no AD involved/frames can be handled separately. Thanks for explaining!

I think the best way forward is to create a video that would normally cause RAM OOM when loading a video (for a particular amount of RAM), and see if exporting the images separately and then using Load Images causes any difference in behavior - my hunch is that it's some intermediate part of the process that causes the bottleneck. Assuming we can confirm if this is still an issue, then we can experiment with some alternate approaches to loading video frames, like your method a few months ago of using ffmpeg to export frames, and then load them in. If both Load Videos and Load Images have the same OOM limit, then the only real suggestion we can make besides Batch Manager would be to use the "force_size" input to resize the images as it loads them, so that they will use less space.

Just to make sure my understanding is correct, with the Batch Manager, does Video Combine produce a single video? And if so, does it do this by concatenating the images into a single tensor, or some other method of combining the videos together? If it does combine them into one tensor before making the video, that would point toward an intermediate bottleneck in Load Video, and the solution would be to try to replicate how those images are combined but in Load Video instead.

AustinMroz commented 7 months ago

It applies the same state caching as normal but with the Batch Manager node (and everything dependent upon it) marked stale. Load Video and Video Combine both have special handing where the input/output handler is held open between executions. For Load Video, this means that there's no extra latency of seeking through the frames which are skipped.

Please let me know if you find anything. I have a recently updated load_video_alpha branch in which this same functionality is introduced to the old ffmpeg loading code, but I haven't done any memory comparisons yet and I'm not currently intending for the branch to be merged.

To confirm, with the Batch Manager, Video Combine produces a single output file, but frames are no longer held in memory after a workflow completes. Instead, after each workflow sub-execution the output folder contains a partial video file with the encoded frames up to that point. It should be currently possible (with no code changes, but a beefy computer) to have Load Video (Path) take a url to a video stream as an input file, begin workflow execution, and have a seperate program stream from the partial output file to create a never-ending output stream.

EDIT: Quick testing shows the ffmpeg load code taking a negligibly smaller amount of memory, ~~but it took significantly longer to execute (30s ffmpeg to 6 opencv)~~ Performance has been fixed

single Load Video (ffmpeg)
12782368
12819692
12779780
x3 Load Video (ffmpeg)
12821056

H-Bole commented 7 months ago

136 There is a case of my batch management