Zhen-Dong / Magic-Me

Codes for ID-Specific Video Customized Diffusion
https://magic-me-webpage.github.io/
Apache License 2.0
463 stars 38 forks source link

Face VCD #6

Closed garychan22 closed 9 months ago

garychan22 commented 9 months ago

Hi, thanks for your excellent work here.

After reading the paper, I have got two problems: 1) how to derive a natural face by leveraging partial denoising to refine the face in Face VCD without face keypoint control since I2I or V2V pipelines would lead to inconsistent local results; 2) how to preserve the background, with mask or something or the extended ID token would help?

thanks!

visionMaze commented 9 months ago

1) Achieving a natural face in Face VCD is accomplished through the strategy of partial denoising. This approach maintains a substantial amount of pixel data from the original face amidst the noise, which serves as a guide for the denoising process. This method ensures spatial consistency, allowing for a refined output that aligns closely with the original facial structures. Moreover, when dealing with animated content, AnimateDiff module ensures that the denoising process conditions on multiple frames, facilitating a smooth transition between frames and preserving temporal consistency.

2) Preserving the background while refining the face is managed through the use of face segmentation by SAM, which enable the models to specifically target the facial area for refinement, while leaving the surrounding background intact.

garychan22 commented 9 months ago

@visionMaze thanks for the reply, maybe I have misunderstood the partial denoising as the similar img2img with smaller denoising strength (where the facial structures are usually changed when strenght is larger than 0.4 as well) since the details of this strategy are missing in the paper. I will dive into the code for what is happening.

visionMaze commented 9 months ago

The understanding is mostly correct. The difference between common partial denoising and ours is that we are using the extended ID tokens to render the face of the identity. The face would be changed after the partial denoising but usually in a way that looks more like the identity. And don't forget the AnimateDiff module used here. AnimateDiff module makes the denoising based on a series of frames, which a stronger condition than img2img.

garychan22 commented 9 months ago

@visionMaze oh, that makes sense (sorry that I forgot the AnimateDiff module on multiple frames), thanks a lot!

askerlee commented 9 months ago

Hi by extended ID tokens, do you mean multiple embeddings for the same subject? Thanks.