[Bug]: IP-Adapter error (calling ImageEmbed) when using multiple images w/AnimateDiff

tyDiffusion commented 1 month ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

What happened?

I am testing AnimateDiff and IP-Adapter, and experimenting with multiple IP-Adapter inputs (in my experimentation, I am adding X images to IP-Adapter, where X is the total number of AnimateDiff output frames) - all this is done through the API only.

For the command line API, I am adding images to my IP-Adapter ControlNet like this:

"batch_images": "{path_to_images}",
"input_mode": "batch",
"animatediff_batch": true,

If I add multiple IP-Adapter images, I satisfy the following condition in controlnet.py

 elif unit.is_animate_diff_batch or control_model_type in [ControlModelType.SparseCtrl]: (line 948)

and then

elif unit.accepts_multiple_inputs:

That's fine and good - it tells me the code flow is correct as my inputs are being recognized as an AnimateDiff batch process for a ControlNet that accepts multiple inputs (IP-Adapter).

However, the following line in controlnet.py generates an error:

c = ImageEmbed(c_full, ip_adapter_emb.uncond_emb, True)

Looking into ipadapter_model.py, I think I see why the error occurs, because the ImageEmbed class has no field for the boolean value in the argument list. However, when I modify the above line like this:

c = ImageEmbed(c_full, ip_adapter_emb.uncond_emb)

The code execution progresses further, but shortly after I get another error:

IndexError: too many indices for tensor of dimension 3 #>> line 181 in ipadapter_model.py

So something is clearly not working...however, maybe I don't understand the types of inputs required for a multi-image IP-Adapter setup w/AnimateDiff?

The goal here is to have an IP-Adapter input-image-per-frame, so that the result of each frame in AnimateDiff is tuned to the corresponding IP-Adapter input image. The same works with other ControlNets (ex: I provide identical batch image input to a Depth ControlNet, and the AnimateDiff result will use each depth input for the corresponding animation frame)...it's just IP-Adapter that fails....

Steps to reproduce the problem

See above

What should have happened?

See above

Commit where the problem happens

N/A

What browsers do you use to access the UI ?

No response

Command Line Arguments

N/A

List of enabled extensions

N/A

Console logs

N/A

Additional information

No response

tyDiffusion commented 1 month ago

Alright, I've been diving into the codebase to track down the cause of these problems and there are a few issues at play:

1) The improper call to the ImageEmbed constructor is due to this past commit: https://github.com/Mikubill/sd-webui-controlnet/pull/2725

ImageEmbed.bypass_average was removed, but not all calls to ImageEmbed were updated - so the fix there is as simple as removing the unnecessary bool argument in the problematic function calls.

2) In plugable_ipadapter.py, the preprocessor_outputs argument of the hook function has types that are not accounted for in the function logic.

In my experimentation, depending on my API inputs and whether or not I'm using AnimateDiff and/or multi-image input, preprocessor_outputs can have at least 3 different types: ImageEmbed, dict and tuple. However, the current function logic does not account for preprocessor_outputs having type ImageEmbed, resulting in an error when the average_of function is called on it.

I'm not sure if the core issue here is the conversion conditional in this function, or something upstream (why is preprocessor_outputs already an ImageEmbed object when passed to that function?)...I will continue to dig further and report my findings, assuming @Mikubill doesn't chime in sooner.

tyDiffusion commented 1 month ago

Ah...in line 958 of controlnet.py, in the ad_process_control function, the value returned by that function (which ends up as "preprocessor_inputs" in the hook function) is being assigned as:

c = ImageEmbed(c_cond, c.uncond_emb) #, True) (invalid bool argument removed in ImageEmbed constructor)

But only if AnimateDiff is enabled and the ControlNet is IP-Adapter - that explains why it's not being passed in as a dict/list in the case where it's causing a downstream error in the hook function.

Same goes for further down in the ad_process_control function, if keyframes are found. So that function is the main source of the type mismatch.

tyDiffusion commented 1 month ago

Ok, with the fixes to the ImageEmbed calls listed above, and the following change to the conditional logic of the hook function in plugable_ipadapter.py, IP-Adapter/AnimateDiff keyframe-based prompt travel is working, as well as single/multi-image inputs in regular image generation:

if (isinstance(preprocessor_outputs, ImageEmbed)):
    self.image_emb = preprocessor_outputs

elif (isinstance(preprocessor_outputs, dict)):
    self.image_emb = self.ipadapter.get_image_emb(preprocessor_outputs)            

elif (isinstance(preprocessor_outputs, tuple)):
    self.image_emb = ImageEmbed.average_of(*[self.ipadapter.get_image_emb(o) for o in preprocessor_outputs])

Mikubill / sd-webui-controlnet