Question about mask latents

The proposed technique works well but I have a confusion in the pipeline. In the SD2's diff_pipe.py, it seems like map is downsampled by the vae_scale_factor:
map = torchvision.transforms.Resize(tuple(s // self.vae_scale_factor for s in image.shape[2:]),antialias=None)(map) and then directly multiplied with image latents: masks = map > thresholds latents = original_with_noise[i] * mask + latents * (1 - mask)

why isn't mask latents being computed? For example, inpainting pipeline of diffusers has the section where they calculate latents of mask and masked_image. Isn't such step necessary?

exx8 / differential-diffusion

Question about mask latents #30