Gourieff / sd-webui-reactor

Fast and Simple Face Swap Extension for StableDiffusion WebUI (A1111 SD WebUI, SD WebUI Forge, SD.Next, Cagliostro)
GNU Affero General Public License v3.0
2.17k stars 235 forks source link

[Feature]: Fix the realistic person's tongue #439

Open timmyhk852 opened 3 weeks ago

timmyhk852 commented 3 weeks ago

Feature description

The tongue cannot be generated by using this extension. So the person cannot tongue out.

johndpope commented 3 weeks ago

this PR I drafted some time back - the plan was to ignore the bottom half of face in the mask if you click Face Mask Correction to see difference. https://github.com/Gourieff/sd-webui-reactor/pull/292

timmyhk852 commented 3 weeks ago

this PR I drafted some time back - the plan was to ignore the bottom half of face in the mask if you click Face Mask Correction to see difference. #292

same after ticking Face Mask Correction,

johndpope commented 3 weeks ago

@timmyhk852 - I tested with tongue out - and it failed. I have had results with mouth open - though it's not good enough. was looking at this again today - maybe can use mediapipe to create the mask and cut the bottom of the mask to the top lips. but results will probably still disappoint. I'm wondering if the insightface / onnx / mapper backing model is simply inadequate or it needs another model to help here. The reactor / roop stuff works fantastic but it fails miserably in this use case. I'm wondering if the inswapper_128.onnx could be translated back to pytorch - and the result of faceswap could be somehow passed back into pipeline.like make the onnx model become a lora of sorts operating in the latent space.

for the simpler cosmetic approach - @Gourieff -did you use mediapipe not sure how to articulate this - but I'm not clear on how I could pass mediapipe coordinates to create a different mask


  # Process the image to detect face landmarks.

    self.mp_face_mesh = mp.solutions.face_mesh
        results = self.mp_face_mesh.process(image_rgb)

        img_h, img_w, _ = image.shape
        face_3d = []
        face_2d = []

        if results.multi_face_landmarks:       
            for face_landmarks in results.multi_face_landmarks:

https://github.com/johndpope/Emote-hack/blob/main/Net.py#L941

apply_face_mask_with_exclusion(swapped_image=swapped_image,target_image=result,target_face=target_face,entire_mask_image=entire_mask_image,MEDIA_PIPE_LANDMARK_MASK_WITH_HEAD_CUT_TO_TOP_LIPS)

this is advance detection of lips from a project I was reviewing the other week https://github.com/Zejun-Yang/AniPortrait/blob/cb86caa741d6ab1e119ea7ac2554eb28aabc631b/src/utils/face_landmark.py#L133

it's possible I could have this contained and wired up to just do this augmentation of mask

Gourieff commented 3 weeks ago

I'm wondering if the inswapper_128.onnx could be translated back to pytorch - and the result of faceswap could be somehow passed back into pipeline.like make the onnx model become a lora of sorts operating in the latent space.

I've been thinking about this as well... We need to make a "reverse engineering" of the inswapper model to improve it and make a new model with 256 or 512 target-input (it would be great for Community to have really free-licensed model with HQ output) and maybe with an additional masking input or as you suggested in the way as it could be a Lora

About masking of parts... There is smth like this in Facefusion, I've not tested it with tongues, but it works with lips and teeth So I have in plans to implement such segmenting for ReActor in future updates, just need to find free time for this

Gourieff commented 3 weeks ago

https://github.com/johndpope/Emote-hack/blob/main/Net.py#L941

apply_face_mask_with_exclusion(swapped_image=swapped_image,target_image=result,target_face=target_face,entire_mask_image=entire_mask_image,MEDIA_PIPE_LANDMARK_MASK_WITH_HEAD_CUT_TO_TOP_LIPS)

this is advance detection of lips from a project I was reviewing the other week https://github.com/Zejun-Yang/AniPortrait/blob/cb86caa741d6ab1e119ea7ac2554eb28aabc631b/src/utils/face_landmark.py#L133

it's possible I could have this contained and wired up to just do this augmentation of mask

Hm... Rather interesting... 🧐

johndpope commented 2 weeks ago

somewhat related - https://github.com/AtlantixJJ/PVA-CelebAHQ-IDI

UPDATE https://github.com/JackAILab/ConsistentID

johndpope commented 2 weeks ago

had a play with ConsistentID - IT WORKS!!!! after some faffing around - https://github.com/JackAILab/ConsistentID/issues/18

timmyhk852 commented 2 weeks ago

had a play with ConsistentID - IT WORKS!!!! after some faffing around - JackAILab/ConsistentID#18

I dont understand...so is it possible for the swapped face to tongue out now?

Gourieff commented 2 weeks ago

had a play with ConsistentID - IT WORKS!!!! after some faffing around - JackAILab/ConsistentID#18

Nice! I'll take a look next week Maybe we can combine your PR with this feature, it would be super-good

johndpope commented 2 weeks ago

the consistenid works by introducing a new stablediffusion pipeline https://github.com/JackAILab/ConsistentID/blob/main/infer.py need to review other automatic1111 plugins to get my head around this flow. @Gourieff - does any plugin come to mind?

for my needs - just plugging in to infer.py is fine - just select the SD model - and you can add loras.

# TODO import base SD model and pretrained ConsistentID model
device = "cuda"
base_model_path = "SG161222/Realistic_Vision_V6.0_B1_noVAE"
consistentID_path = "./ConsistentID_model_facemask_pretrain_50w.bin" # pretrained ConsistentID model
# "philz1337/epicrealism" # 
# Gets the absolute path of the current script
script_directory = os.path.dirname(os.path.realpath(__file__))

### Load base model
pipe = ConsistentIDStableDiffusionPipeline.from_pretrained(
    base_model_path, 
    torch_dtype=torch.float16, 
    use_safetensors=False
).to(device)

i had initially used Marlyn Monroe - and the results were quite good but now jury is out - im using different loras and faces and the results are a bit off - they have plans to increase the input images to an array of faces. @timmyhk852 - basically the model faceinsight can't handle tongues / mouth opens we need to explore some "photoshopping" cut and paste work with masks - try saving the originally - and then have to merge the two images.

TimEyerish commented 2 weeks ago

Bumping detection threshold up over 0.86 has hit and miss after the 3rd or 4th generation in a batch. Mostly it loses the mask when it does work. More consistency over 0.90 but by then there is no mask. Maybe there's something to be tweaked in that?