ali-vilab / AnyDoor

Official implementations for paper: Anydoor: zero-shot object-level image customization
https://ali-vilab.github.io/AnyDoor-Page/
MIT License
4.01k stars 366 forks source link

Problems of the Used Loss Function #103

Open AlonzoLeeeooo opened 3 months ago

AlonzoLeeeooo commented 3 months ago

Hi @XavierCHEN34 ,

Thanks for your great work! I have been reading your published paper manuscript as well as the code implementation, and I came into a problem about the used loss function. It would be highly appreciated if you could explain how this works.

Here is how it goes. In the paper manuscript, specifically in Eq. (2), the overall training objective of AnyDoor is an MSE loss between the U-net output and the ground-truth image latents, corresponding to the following figure: image

In the code implementation, the loss type is controlled by self.parameterization, where self.parameterization is set to "eps" by default. It is also not changed in the configuration file (configs/anydoor.yaml).

Therefore, in the p_losses() function of ldm/models/diffusion/ddpm.py (line 367 to line 411), we can see:

    def get_loss(self, pred, target, mean=True):
        if self.loss_type == 'l1':
            loss = (target - pred).abs()
            if mean:
                loss = loss.mean()
        elif self.loss_type == 'l2':
            if mean:
                loss = torch.nn.functional.mse_loss(target, pred)
            else:
                loss = torch.nn.functional.mse_loss(target, pred, reduction='none')
        else:
            raise NotImplementedError("unknown loss type '{loss_type}'")

        return loss

    def p_losses(self, x_start, t, noise=None):
        noise = default(noise, lambda: torch.randn_like(x_start))
        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
        model_out = self.model(x_noisy, t)

        loss_dict = {}
        if self.parameterization == "eps":
            target = noise
        elif self.parameterization == "x0":
            target = x_start
        elif self.parameterization == "v":
            target = self.get_v(x_start, noise, t)
        else:
            raise NotImplementedError(f"Parameterization {self.parameterization} not yet supported")

        loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])

        log_prefix = 'train' if self.training else 'val'

        loss_dict.update({f'{log_prefix}/loss_simple': loss.mean()})
        loss_simple = loss.mean() * self.l_simple_weight

        loss_vlb = (self.lvlb_weights[t] * loss).mean()
        loss_dict.update({f'{log_prefix}/loss_vlb': loss_vlb})

        loss = loss_simple + self.original_elbo_weight * loss_vlb

        loss_dict.update({f'{log_prefix}/loss': loss})

        return loss, loss_dict

if self.parameterization == "eps", target will become random Gaussian noise, where the loss function will be MSE loss between the U-net output and random Gaussian noise. This is confict with the one shown in the paper manuscript.

According to Eq. (2) in the paper manuscript, I suppose that self.parameterization should be set to "x0", resulting in that target will become x_start, so that the code implementation could align with the formula. Am I understanding this correct? Please enlighten me if I have get anything wrong. Looking forward to your reply.

Best regards

mao-code commented 1 month ago

Same question here