Closed berkanz closed 3 months ago
closing because I understood that the original inpainting pipeline also doesn't separately calculate the latent of the mask, instead multiplying latent of the masked image with the downsampled version of the mask itself.
The proposed technique works well but I have a confusion in the pipeline. In the SD2's diff_pipe.py, it seems like map is downsampled by the vae_scale_factor:
map = torchvision.transforms.Resize(tuple(s // self.vae_scale_factor for s in image.shape[2:]),antialias=None)(map)
and then directly multiplied with image latents:masks = map > thresholds latents = original_with_noise[i] * mask + latents * (1 - mask)
why isn't mask latents being computed? For example, inpainting pipeline of diffusers has the section where they calculate latents of mask and masked_image. Isn't such step necessary?