[FEATURE REQUEST] Allow multipass of multiple control net images in sequence batches

ca3gamedev commented 1 year ago

Hi, I want to humbly request, or at least give the idea of allowing to combine control nets in some combined pass or some multiple pass process.

Why? Because It would be possible then to use diferent control nets to influence diferent areas of an animation. By example.

Open pose could then control and influence the general pose of the character in a video, which we could make from a video or a 3D render animation with the proper colors. Something like a skirt being animated could then be composited from a video or reference using a combination of sketch, from a sketch animation made in something like pencil2d, or taken from a video using HED control net. Similar to let's say the hair being properly animated from a 2D sketch animation made quickly using a tool like pencil2D.

Other objects could be extracted from the video using color segmentation mask, which could be generated from blender animation render. And finally all these diferent video mask and references could be then combined into diferent input regions to be controlled using diferent control nets. (open pose for limbs, sketch for things like facial expressions and 2D animations of things like hair strands, and segmentation for things like separation of diferent clothe elements).

Every control net is powerful and would be perfect for a diferent element of the animation into a final composite picture.

Not sure if my idea makes sense, but I am simply an amateur artist that can use 2D and 3D animation tools. Sorry if my english is poor and there's mistakes. :)

AbyszOne commented 1 year ago

It's an interesting idea. Honestly, unlike Pix2Pix, which was quickly integrated, controlnet's potential is vast enough that you'd need to rebuild the gui around it. Current infrastructure limitations restrict highly disruptive uses.

enn-nafnlaus commented 1 year ago

I see people keep opening issues related to this because everyone wants to do it. Myself included :Þ

I've been having great fun with prompt travel with ControlNet: https://www.youtube.com/watch?v=LXqG_lG1B20

... barring the issues here: https://github.com/Mikubill/sd-webui-controlnet/issues/417#issuecomment-1448233009

But it'd be so much better if ControlNet had a list or directory of images rather than a single one, and would increment through them once per frame (someone in another issue requested random images: https://github.com/Mikubill/sd-webui-controlnet/issues/362 - that'd just be a checkbox). You don't have to implement the functionality to run txt2img or img2img more than once - there's already ample functionality for that in AUTOMATIC1111 (various scripts or built-in batch options). You just need to have it swap images every time generation starts.

scruffynerf commented 1 year ago

FYI, I've just posted code in #444 that could easily be used (and in fact, I'm using it this way), to grab images and insert into 1 or more CNs, as part of a standard txt2img or img2img flow, using a custom script. I can iterate thru a directory of poses, and it'll insert each in turn, generate an image with the prompt, and then loop to the next, all controlled by one script.

So rather than add every desired feature under the sun to CN, supporting a way for custom scripts to more easily control CN should be encouraged.

ljleb commented 1 year ago

By the way if you want a web api client that works with controlnet, there is one: https://github.com/mix1009/sdwebuiapi

External code support will be more useful if you want to write a custom extension that wants to interface with controlnet.

enn-nafnlaus commented 1 year ago

@scruffynerf So every single script would need to be modified? This sounds like an awful idea.

It should be pretty obvious by now that anywhere in AUTOMATIC1111 that takes an image, it should be able to take batches of images. Every single image input has gotten batch input options. Because, for obvious reasons, people want batch inputs.

ControlNet should not go off and "do its own thing" and omit batch image inputs.

enn-nafnlaus commented 1 year ago

@ljleb And how would that help with prompt travel? Seed travel? Shift attention? Literally anything that does multiple generations on a given set of inputs? Also, telling people "go write some separate plugin that uses the API" every single time someone wants to do something different with ControlNet is not a solution. Nor is it consistent with AUTOMATIC1111's design philosophy.

ControlNet should - like literally all other image inputs - have a batch image input. And if ControlNet is enabled, and there's a batch of images, it should start at the top of the list when "Generate" is clicked, and iterate over the image list once per generation.

ljleb commented 1 year ago

@enn-nafnlaus My comment above was related to external code support, not to this issue particularly.

I think batch image inputs is a pretty useful feature. A lot of people requested it. What I'm saying is, very narrow features may not be added right away if they are ever added. We should start with basic batch support, for example batch multiple dirs, one for each controlnet, in sync, while keeping in mind to leave space for other more specific features like determine precisely which control unit is active on which frame.

Maybe I just don't see it well, idk. IMO we should split this feature into orthogonal parts and then implement a solid base on top of which we can add details by priority.

scruffynerf commented 1 year ago

@scruffynerf So every single script would need to be modified? This sounds like an awful idea

That's not what I said. Right now, any other existing script that wants to control CN has a lot of work to do so. Trivial custom scripts, or bigger more elaborate ones.

Expecting CN to add every feature isn't feasible. Making it easier to for other scripts to use CN is.

Yes, selecting a "batch directory" seems trivial. Did you know there is no gradio interface to do so? Did you think that even something simple like "do random" or "do all" means multiple logic points each of which isn't trivial to handle all of the edge cases. To name just one, how do we handle non images and what constitutes a non image? What about an image mask? What about detecting only certain types of images, named a certain way... Open pose editor supports a json save format, so if someone has a folder of those, and wants to use them as a source for a CN layer, I don't expect CN to support that... But it's totally reasonable to want it.

BTW, I'm working on a script to do all of this. What are you doing about it?

ljleb commented 1 year ago

Just a kind reminder to keep a friendly tone. I know everyone wants this extension to be the best, this is why we're talking about these features right now. I don't know if it's possible to satisfy everyone, but we can close that gap if we take the time this needs to take.

xuhao1 commented 1 year ago

Is this possible to use ControlNet m2m module to achieve this? Just concat the pose images to a video and generate another video.

Aayush-Jain01 commented 1 year ago

def send_request(last_image_path, optical_flow_path,current_image_path):
    url = "http://localhost:7860/sdapi/v1/img2img"

    with open(last_image_path, "rb") as b:
       last_image_encoded = base64.b64encode(b.read()).decode("utf-8")

    # Load and process the last image
    last_image = cv2.imread(last_image_path)
    last_image = cv2.cvtColor(last_image, cv2.COLOR_BGR2RGB)

    # Load and process the optical flow image
    flow_image = cv2.imread(optical_flow_path)
    flow_image = cv2.cvtColor(flow_image, cv2.COLOR_BGR2RGB)

    # Load and process the current image
    with open(current_image_path, "rb") as b:
       current_image = base64.b64encode(b.read()).decode("utf-8")

    # Concatenating the three images to make a 6-channel image
    six_channel_image = np.dstack((last_image, flow_image))

    # Serializing the 6-channel image
    serialized_image = pickle.dumps(six_channel_image)

    # Encoding the serialized image
    encoded_image = base64.b64encode(serialized_image).decode('utf-8')

    data = {
        "init_images": [current_image],
        "inpainting_fill": 0,
        "inpaint_full_res": True,
        "inpaint_full_res_padding": 1,
        "inpainting_mask_invert": 1,
        "resize_mode": 0,
        "denoising_strength": 0.4,
        "prompt": args.prompt,
        "negative_prompt": args.negative_prompt,
        "alwayson_scripts": {
            "ControlNet":{
                "args": [
                    {
                        "input_image": current_image,
                        "module": "hed", #Apply this preprocessor before passing to the model
                        "model": HED_MODEL, #Apply this model
                        "weight": 0.7,
                        "guidance": 1,
                        "pixel_perfect": True, #Apply pixel_perfect preprocessor
                        "resize_mode": 0, #Resize accordingly for target size
                   },
                    {
                        "input_image": encoded_image,
                        "model": TEMPORALNET_MODEL,
                        "module": "none", #No pre-processing required before passing to TEMPORALNET_MODEL
                        "weight": 0.6,
                        "guidance": 1,
                        # "processor_res": 512,
                        "threshold_a": 64,
                        "threshold_b": 64,
                        "resize_mode": 0,
                    },
                    {
                        "input_image": current_image,
                        "model": OPENPOSE_MODEL,
                        "module": "openpose_full",
                        "weight": 0.7,
                        "guidance": 1,
                        "pixel_perfect": True,
                        "resize_mode": 0,
                    }

                ]
            }
        },
        "seed": 4123457655,
        "subseed": -1,
        "subseed_strength": -1,
        "sampler_index": "Euler a",
        "batch_size": 1,
        "n_iter": 1,
        "steps": 20,
        "cfg_scale": 6,
        "width": args.width,
        "height": args.height,
        "restore_faces": True,
        "include_init_images": True,
        "override_settings": {},
        "override_settings_restore_afterwards": True
    }
    response = requests.post(url, json=data)
    if response.status_code == 200:
        return response.content
    else:
        try:
            error_data = response.json()
            print("Error:")
            print(str(error_data))

        except json.JSONDecodeError:
            print(f"Error: Unable to parse JSON error data.")
        return None

I am trying to reproduce this multiple control net API call in Python code, I want to ask how exactly is alwayson_scripts handled as def process_image(p) doesn't seem to do anything with p.script_args ?

If I interpret the code correctly we add the model idxs mentioned under the "args" in the init_script_args function called from the img2imgapi function ?

Does this mean that the image is first passed to the HED_MODEL then it's output is passed to the temporalnetmodel and then its output is passed to the OPENPOSE_MODEL ?

My aim is to not use the API and rather get it working by loading pre-trained models and inferencing from them, how should I replace the functionality of alwayson_scripts and implement it in simple python code ?

Mikubill / sd-webui-controlnet

[FEATURE REQUEST] Allow multipass of multiple control net images in sequence batches #296