Appreciate this excellent work! But I'm confused about the training target and loss function.
According to the paper, training target is to recover the masked area of source image with diffusion model, whose "responsibility" is estimating the ground truth noise. So the loss function is noise prediction combined with masked area prediction? What the loss function exactly? Can you tell me more details about it, I would be grateful!
Appreciate this excellent work! But I'm confused about the training target and loss function. According to the paper, training target is to recover the masked area of source image with diffusion model, whose "responsibility" is estimating the ground truth noise. So the loss function is noise prediction combined with masked area prediction? What the loss function exactly? Can you tell me more details about it, I would be grateful!