[Feature Request] Evaluate an image mask to assign weights of parts of the image

MoonMoon82 commented 8 months ago

I'm trying to mask out specific things from an input image to exclude it from being processed. For example background or hair which I cannot crop out of the image.

I already tried to set up a second Apply-Node with the masked content by giving it a negative weight, but it only works partially without any real proper results. I also checked the "Encode IPAdapter Image" to set a negative weight value, but this node only accepts values >=0.

Is there a proper way to mask things out of an image I don't want to show in the results? If not, could you please implement a mask input for (each) image which basicly assigns the weights of parts of the image? (Maybe the alpha channel of an image could be used to evaluate such a mask)

Thank you in advance Kind regards!

cubiq commented 8 months ago

I don't know if it can be done, I quickly checked the clip vision encoder and doesn't seem to support masking, but I'll check if there's anything we can do

MoonMoon82 commented 8 months ago

In case of setting up specific a face, with the ipadapter plus face model it seems to work quite fine, masking it with black color.

My experimental setup: First test without any ipadapter: Results are showing different faces, backgrounds, hairstyles and dresses.

Second test with ipadapter with a basic input image connected: Results are showing the same face, background, hairstyles and color of the dress.

Third test with ip adapter + masked image connected to ipadapter with negative weight: Results are showing the same face, different backgrounds, (more of less different) hairstyles, different dresses.

But in case of any other ipadapter model, this method doesn't work as expected. It would also be nice to store such masked embeddings into a single file, instead of two.

ttulttul commented 7 months ago

You can achieve what you want by connecting your mask to the attn_mask input. This will cause the model to pay attention only to the parts of your image that are included by the mask.

MoonMoon82 commented 7 months ago

@ttulttul That's what I initially tought the attn_mask would be used for, but it seems it does not mask the input (reference) image. It only masks the target area where the ipadapter will be applied to during sampling.

"It's possible to add a mask to define the area where the IPAdapter will be applied to. Everything outside the mask will ignore the reference images and will only listen to the text prompt." https://github.com/cubiq/ComfyUI_IPAdapter_plus#attention-masking

ttulttul commented 7 months ago

@ttulttul That's what I initially tought the attn_mask would be used for, but it seems it does not mask the input (reference) image. It only masks the target area where the ipadapter will be applied to during sampling.

"It's possible to add a mask to define the area where the IPAdapter will be applied to. Everything outside the mask will ignore the reference images and will only listen to the text prompt." https://github.com/cubiq/ComfyUI_IPAdapter_plus#attention-masking

Oh wow, I had no idea. But it does make a good deal of sense. I suppose you could mask out your source image AND use attn_mask to mask out where IPAdapter is applied to somewhat achieve what you’re hoping for. The trouble with this approach is that, presumably, IPAdapter will still be feeding CLIP with the entire rectangular area of your source image and will considered the area excluded by your mask to be “blackness”, which may cause strange results.

What we need to do is to figure out how to get CLIP to attend to only the masked portion. I’m not sure it has that ability.

ttulttul commented 7 months ago

TL;DR: It seems if your goal is to use IPAdapter to control the look of a subject separately from the look of the background, you should send a masked out image of the subject and also send a subject mask to the attn_mask input of the IPAdapter. Furthermore, you should apply an inverse mask of the subject to the attn_mask input of another IPAdapter in case you wish to use an IPAdapter to fill in the background with its own style.

Here are some experiments to see what happens with various approaches.

First, I mask out Donald Trump from an image of him sitting at his desk, giving me an image of the man with a black background and a mask that I can use with attn_mask if I wish to. The image of Donald Trump, with or without masking the background pixels, and with or without applying the mask to attn_mask, is passed to a first IPAdapter.

Second, I send a background image of a desert to a second IPAdapter. I optionally connect the Donald Trump person mask to attn_mask on this IPAdapter, or I leave attn_mask disconnected to allow the desert image to apply to the whole generation.

There are eight experiments to run based on the following combinations:

a) Full image or b) masked-out image of Donald Trump connected to IPAdapter One.
Donald Trump mask a) connected to attn_mask or b) not connected to attn_mask of IPAdapter One.
Donald Trump mask a) connected to attn_mask or b) not connected to attn_mask of IPAdapter Two.

I use a single CLIP positive prompt "donald trump" and no negative prompt. The image is generated at 512x512 using Epic Realism as the checkpoint, with LCM LoRA to speed things up. CFG is 1.6 and the scheduler is sgm_uniform over 20 steps.

Source Images and Masks

Photo of Donald Trump with his desk etc.:

Masked out photo with a black background:

Image mask:

Background image of a desert:

The general workflow; please ignore the extra IPAdapter for the face only, which is not used:

Experiments

Subject Image	Subject attn_mask	Background attn_mask
Masked	No	No
Masked	No	Yes
Masked	Yes	No
Masked	Yes	Yes
Full	No	No
Full	No	Yes
Full	Yes	No
Full	Yes	Yes

ttulttul commented 7 months ago

I forgot to mention that in the above experiments, I was using a DW Open Pose detector ControlNet to position the subject quite precisely in relation to the source image. If you remove the ControlNet, the output is similar but the model has more creativity. Here is the output with the subject being masked out and attn_mask being set on the subject and background IPAdapter nodes - for fun, I also passed the output through an upscaler:

If we remove the attn_mask input from the subject IPAdapter, we see the model tries to fill in the rest of Donald Trump's office even though in the source image for IPAdapter, there is no office, only a black background:

I hypothesize this is because so many training images - likely including this image! - would have shown Donald Trump sitting at a desk, signing papers... It's worth testing out the various configurations above using other source imagery to see how it works in other circumstances with less famous subjects.

MoonMoon82 commented 7 months ago

@ttulttul Your workflow and experiments look quite similar to mine. But there's a main difference: You're masking the output and I don't want to mask the output and even if you're not masking the output, the result resembles the reference image (in pose, position etc) As I already mentioned, for masking a face using the ip-adapter-plus-face the workflow I've shown already works pretty well! For every other ipadapter model (especially the ip-adapter-plus model) the results are horrible. Black-masking the reference image is just not real masking.

ttulttul commented 7 months ago

In my experience, if you put the strength of the adapter down to 50% and don’t otherwise control the pose of the subject, you can get quite flexible output but with consistent characters. Your mileage may vary, of course.

cubiq commented 7 months ago

I looked into this and it's not currently possible to mask the reference image. I'll review this in the future in case things change

MoonMoon82 commented 7 months ago

@cubiq Thank you very much in advance!

cubiq / ComfyUI_IPAdapter_plus

[Feature Request] Evaluate an image mask to assign weights of parts of the image #81