Acly / krita-ai-diffusion

Streamlined interface for generating images with AI in Krita. Inpaint and outpaint with optional text prompt, no tweaking required.
https://www.interstice.cloud
GNU General Public License v3.0
6.46k stars 312 forks source link

Masking: Optimal Krita workflow #225

Closed BinaryQuantumSoul closed 7 months ago

BinaryQuantumSoul commented 9 months ago

I wanted to discuss what would be the most important improvement with this layer-based sd UI. In my opinion, it is mask layers.

The optimal workflow would be to make multiple mask layers on the image. Each mask layer could be associated with its own prompt, loras and embeddings and could be linked to adapter layers like controlnet or IP. When generating an image, it would call the normal txt2img or img2img workflow but with those regional controls.

Example cases would be generating different people with different prompt/lora masks, generating a character with specific clothes from different ipadapter masks, and all the actual use cases but with this unified approach.

All those are doable inside comfyui, but krita would be far better for that.

AvidSpline commented 9 months ago

This could be done with IP Adapter masked regions, right ? I'm not sure what the ui for this might look like though.

BinaryQuantumSoul commented 9 months ago

Indeed. Masked regions for ipadapter, Masked conditioning for prompts embeddings and controlnet. I don't know how to do loras though.

I'm not familiar with krita UI but I know it's the best for this use case. What would be ideal would be a list of settings (prompt, embeddings, lora, ipadapter images, controlnet (+optional control image)) which would appear for each layer of type "mask". Normal layers would be used for the img2img base latent

Acly commented 9 months ago

Not sure I got it right, but it sounds like a kind of batch workflow, where you set up a lot of things in advance by assigning generation settings to various regions of the image, and then execute them all at once?

It could be done, but it's kind of the opposite direction of where this plugin is going. The focus so far is on an interactive, iterative workflow, where you work on one part of the image exactly until you're happy with the partial result, then proceed to the next. Given how unpredictable the results often are, I find it far more efficint to iterate on eg. composition, then individual parts, until satsified, and then continue to build on top of that.

BinaryQuantumSoul commented 9 months ago

Well you indeed can set lots of things in advance. BUT it also can be used as iterative workflow. For example generate some pants and some tshirts then use ipadapter on the composition of both using cn pose.

And if you want to make an interesting composition, masking will provide a more coherent output than inpainting

Grant-CP commented 8 months ago

I wanted to add images/resources related to what Binary is saying. Here's an example workflow from the developer of the ipadapter_plus extension: image (from the video: https://www.youtube.com/watch?v=vqG1VXKteQg)

In this workflow, he is masking one of the images to perform IP adapter conditioning on the left, and the other to do the same on the right. He ends up with an image like this: image

Note that the background is consistent, the texture of lips is consistent etc. This is because it was all generated at the same time. This is also useful in other cases, such as when you want to use a full controlnet pose guide, but you want to mask the attention so that the face is applied at 1/2 the strength as the body parts. You could regenerate the face a bunch of times after an initial pass with openpose, but that would tend to take a big hit in terms of consistency.

Imagine using one IP adapter for a particular set of pants, and other IP adapter for a set of shoes, and wanted the buckles on the belt to match the metal on the shoes. That would be pretty hard to do iteratively compared to doing it manually. Here's a reddit post of someone doing three-part masked attention: https://www.reddit.com/r/comfyui/comments/189sfmw/after_the_split_animal_post_from_earlier_i_went/

My personal opinion is that this sort of thing gets complex very fast and it's probably worth just implementing custom workflows that can send any number of layers to comfyui instead and letting the technical user figure it out. I do agree with Binary that masked attention for controlnets and prompts is very powerful and does eventually belong in an integrated way into programs like Krita. It also seems reasonable that it would be beyond the scope of this extension.

BinaryQuantumSoul commented 8 months ago

Yes that's exactly what I was talking about. I had this workflow idea by watching this youtube channel (Latent Vision). Although what I'm proposing is even more general since it would allow any number of masks for controlnet, ipadapter, prompt lora and embedding

AvidSpline commented 8 months ago

My issue without this is that when I get one part of the image exactly how I want it, then move out to do a lower denoised regeneration of the whole image to make things consistent it changes the other parts that I'd already dialed in. So I erase those parts to show the image below and keep regenerating, but it tends to do things like turn the hair color of all the characters the same, change the face structures in ways I don't want... and I can use the inverse mask, change to mask layer trick to say "don't change this part", but it still doesn't blend seamlessly with the rest of the image and seems really inconsistent.

Maybe there's a better way to do this and I'm just not used to the tools ? I usually just pull the image back to comfy, but I really like how live generations help me iteratively refine parts of an image. It's just pulling that back into the composition without changing it too much that's a bear of a problem.

One way I thought of to do this is to expand the controlnet line to two lines, have a checkbox "use mask layer" that lets you select a simple black white layer with a mask drawn on it and injects it as a comfyui mask node attached to that controlnet.

You could also a new controlnet style drop down option "Mask Prompt" that lets you enter a prompt as well as a mask layer.

Also IPAdapter would be useful to have if that's possible. It's very helpful for e.g. character consistency and it could work roughly the same way as controlnet ui wise.

Grant-CP commented 8 months ago

@BinaryQuantumSoul Do you think you could help out and find an example of someone using attention masking for a Lora? I’ve got some working examples for controlnet, prompt (which includes embeddings), and ipadapter, but it’s not obvious to me how a Lora would only be applied to some regions easily in comfyui. Maybe there’s a way to do with with the regionalsampler stuff in the Impact pack? Let me know if you find anything on that.

@AvidSpline I’m not sure there’s any silver bullet solution to perfectly changing just part of an image, ultimately things like regional prompting and attention masking for controlnets still affect all of the image at least a little. Controlnet Inpaint is definitely good but not enough on its own I think. I’ve heard some tools like fooocus have their own inpainting algorithm for consistency which we can look into as time goes on.

I haven’t looked too much into the code for the real-time workflow for the Krita extension, but I would expect that it is designed with speed in mind rather than quality. Are your consistency issues the same in live-paint vs regular generation modes? I’m working on putting together some custom workflow code for the extension with the first plan to try out attention masking in live painting. My imagination is that the development cycle would be

  1. Establish custom workflow to demonstrate value of attention masking
  2. Figure out best practices both for comfyui workflow and for Krita document setup
  3. implement into the actual Krita extension by default.

My thought on applying masks is also that they would be regular layers in Krita. However, my thought is that it would make more sense for them to be automatically applied. So for instance, if you had a layer group with [Control] Openpose, [Control] Canny, and [Mask] Mask1, using either of those control layers would apply the conditioning for the controlnet [Control] {Layer} using the mask [Mask]. My limited experience with masked conditioning is that the same mask tends to be reused many times. For example, you might have IP-adapter-face, openpose, and a prompt all pointing to one masked area where you are trying to generate a specific character. I think that would be nicely managed by putting all those layers in the same layer group.

No ideas yet for the best way to represent prompts in layers. Maybe your idea of them acting like controlnets would be good? I think having three separate fields for layer name, positive prompt, and negative prompt will start to get messy pretty fast.

I was thinking maybe having a layer called "[Prompt] prompt1" in a layer group with a [Mask] layer might be a good design? The [Prompt] layer could have text boxes in it for the positive and negative prompts? That way you could also put the text over your composition to visualize where you are prompting what.

The last barrier is also where the conditioning strength lives. I think for controlnets this is already straightforward, but for prompts I feel it is a pretty important option. Maybe [Prompt:1.2] syntax would make sense? Or maybe that’s a good argument for having each regional prompt also be a line item in the extension docker like the controlnets. I’m also not sure that the conditioning strength here works exactly the same as the controlnet strength parameter (I think these strengths are normalized when you combine conditionings?). 

On Jan 11, 2024, at 11:35 AM, AvidSpline @.***> wrote:

My issue without this is that when I get one part of the image exactly how I want it, then move out to do a lower denoised regeneration of the whole image to make things consistent it changes the other parts that I'd already dialed in. So I erase those parts to show the image below and keep regenerating, but it tends to do things like turn the hair color of all the characters the same, change the face structures in ways I don't want... and I can use the inverse mask, change to mask layer trick to say "don't change this part", but it still doesn't blend seamlessly with the rest of the image and seems really inconsistent.

One way I thought of to do this is to expand the controlnet line to two lines, have a checkbox "use mask layer" that lets you select a simple black white layer with a mask drawn on it and injects it as a comfyui mask node attached to that controlnet.

You could also a new controlnet style drop down option "Mask Prompt" that lets you enter a prompt as well as a mask layer.

— Reply to this email directly, view it on GitHub https://github.com/Acly/krita-ai-diffusion/issues/225#issuecomment-1887839229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSQ2YTZYLS3CP2KGBP5F4LYOA5I3AVCNFSM6AAAAABASIZQ2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXHAZTSMRSHE. You are receiving this because you commented.

BinaryQuantumSoul commented 8 months ago

I don't personally have any but I'm pretty sure there's already people who have done it.

I really like your layer group ideas, I hope main dev will be interested once you give the good example workflows you're preparing.

AvidSpline commented 8 months ago

@Grant-CP it happens in either mode (live or quality). It changing broadly is actually something I want, as I’m trying to blend the larger composition with the details, but with masked attention in comfy I’m able to at least somewhat control things. Obviously we can’t do better than comfy with a tool that uses it yet ;)

I didn’t even think of layer groups! I really like this idea. Maybe we don’t have all the options, eg negatives, to start.

Kinda fun inventing new interaction paradigms for ai here.