Kosinkadink / ComfyUI-VideoHelperSuite

Nodes related to video workflows
GNU General Public License v3.0
543 stars 95 forks source link

"Video Combine" raises error because of incorrect list handling #176

Closed jokelbaf closed 6 months ago

jokelbaf commented 6 months ago

I guess line 326 in nodes.py should be something like

images = [tensor_to_bytes(i) for i in images]

Instead of

images = tensor_to_bytes(images)

However, there are a few more issues caused by incorrect images list handling in the code so just fixing the line won't resolve the issue.

Full Traceback:

File "ComfyUI\execution.py", line 152, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ComfyUI\execution.py", line 82, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ComfyUI\execution.py", line 75, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ComfyUI\custom_nodes\ComfyUI-VideoHelperSuite\videohelpersuite\nodes.py", line 326, in combine_video
    images = tensor_to_bytes(images)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "ComfyUI\custom_nodes\ComfyUI-VideoHelperSuite\videohelpersuite\nodes.py", line 93, in tensor_to_bytes
    return tensor_to_int(tensor, 8).astype(np.uint8)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "ComfyUI\custom_nodes\ComfyUI-VideoHelperSuite\videohelpersuite\nodes.py", line 88, in tensor_to_int
    tensor = tensor.cpu().numpy() * (2**bits-1)
             ^^^^^^^^^^
AttributeError: 'list' object has no attribute 'cpu'
Kosinkadink commented 6 months ago

In ComfyUI, the typical convention is for images, masks, and latents to be stored as singular batched tensors (in the case of latents, the tensor is in a dictionary, but just a single tensor for all the latents). The vanilla nodes do it this way, and so do the majority of all extensions. We usually call this a "batch". 99% of nodes in the ecosystem use batch type returns for images, masks, and latents.

The images you plugged in must have come from a node that uses lists instead of a batch, but there are likely nodes alongside it that will convert from a list to a batch. It's unfortunate that the conventional batch and unconventional list both use the same return type, making it an absolute mess to mix nodes that use one with the other.

We could add code so that we convert lists to tensors automatically, since Video Combine is the endpoint for images and so we don't have to open the can of worms about if the output of the node should match the storage type of the input vs always returning batch type. @AustinMroz would you mind adding a little piece of conversion code into Video Combined in the case that the input images are not a tensor?

jokelbaf commented 6 months ago

We could add code so that we convert lists to tensors automatically

This would be a wonderful feature.

AustinMroz commented 6 months ago

Actually, I had been working to move in the opposite direction. There's been a lot of complaints lately about memory usage and this is the one place in the current output code where a non-negligible amount of memory is allocated. To this end, I had instead moved over to using iterators() and map so that the conversion occurs lazily, and the actual converted image data never needs to exist in it's entirety. As a side effect of that work on memory reduction, the new code would function equally well on both tensor and a list inputs as both are iterables over frames.

These changes were made as part of a proof of concept branch (load_latent) to stream image data directly in and out of latent space to get a ~50x reduction in memory usage. While it functions splendidly in the ideal case, I consider this work a failure overall since it increases the burden of knowledge on end users, has heavy performance degradations when multiple output videos are produced, and seems fundamentally incompatible with controlnet requiring full image tensors.

I'll put some time tomorrow into cleaning up and merging just the code for converting image data since it'll resolve this issue and reduce memory overhead without any of the downsides the rest of that branch brings.

Kosinkadink commented 6 months ago

For any stuff that happens internally in nodes, I'd say that's fine, but I would want to avoid straying away from singular tensors for images/latents/masks for any outputs of our nodes.

The list to tensor conversion should be fine, as that is already something that someone who wants to use the list output nodes for anything that interacts with the 99% of the rest of the comfy UI ecosystem would have to use anyway, unless the internals of Video Combine would be doing something else with that input, which would be fine in either case as long as it's only internal.

Kosinkadink commented 6 months ago

Since a lot of memory issues are really centered around loading videos and combining the frames into a video, one approach we could explore on that front is taking advantage of ffmpeg's image capabilities by extracting/saving images in a temporary location, and then (in the case of loading) loading the images or (in the case of saving) using an ffmpeg call to collect the frames in the temp location. Comparing memory usage and performance side by side with the current approach might be handy in determining if it would be worth ever exposing it as a choice for end users.

One case where this would likely be worth it is gifski - I'm not sure if we've changed our current implementation gifski, but it does not work unless it's a specific, community-compiled build that enables a video to be converted instead of a sequence of images. I'm honestly a bit confused why the widely distributed gifski binaries don't have that built in, but I guess that's just something we'll need to account for.

AustinMroz commented 6 months ago

I strongly agree with keeping our outputs as tensors. That branch adds new Load Video/Video Combine nodes that take a vae as input and input/output bog standard latents. All the optimizations are kept internal.

I agree with the correctness of your suggestion for converting to a tensor, but I have created an implementation of my suggestion in the combine_memory_optimizations branch as an example. It provides substantial memory optimizations both under normal operation and in this edge case.

Either I'm misunderstanding your suggestion for loading/saving images, or I'm making a serious mistake in my measurements for memory usage. With the latest round of memory optimizations, I see the following memory usage for 500 frames of 1080p video: image I neither see how temporarily saving individual frames to disk could improve the comparatively minor overhead of our io nodes (with memory optimizations) or could reduce the size of tensors being passed/received by the nodes.

In regards to gifski, I agree. I'm not sure why the default binaries don't include video support (licensing?, binary size?), but general usability will require instead saving temporary images and dealing with associated consequences (by my napkin math, there's a ~400 frame cap on Windows due to max arg length)

Kosinkadink commented 6 months ago

Gotcha, I guess in terms of memory usage, we are at the limits of what we can do short of trying out float16 to halve everything, but that would require pretty much all of comfy to account for that in some way, which is unfortunate. In that case, you can ignore my suggestion for non-gifski stuff.

The combine_memory_optimizations looks good (and fixes the list stuff), so if it has no effect on the produced video compared to before, then I'm good with you merging that in!

AustinMroz commented 6 months ago

I'll go ahead and merge it. I did some further testing hoping to measure what the execution time cost would be for the changes but instead found it to be a measurable speedup and that the original Video Combine implementation was far less memory efficient than I expected.

I actually had the exact same thought with reducing precision and found that torch will automatically upcast operations between a float16 and a float32 to float32. I suspect there would be no downsides to having Load Video provide float16 image tensors, but would absolutely want to test more.