ShoufaChen / DiffusionDet

[ICCV2023 Best Paper Finalist] PyTorch implementation of DiffusionDet (https://arxiv.org/abs/2211.09788)
Other
2.1k stars 162 forks source link

Question about `scale` factor #97

Open lunaryle opened 1 year ago

lunaryle commented 1 year ago

Hi, @ShoufaChen thank you for the great work.

I am a starter regarding diffusion models, and have a question about scale factor applied in prepare_diffusion_concat() and ddim_sample() I understood that signal-to-noise ratio is essential as you mentioned in 4.4.

In the implementation code for noise sampling, you shifted and scaled x_start before q_sampling and shifted/scaled back for the model's diffused input.

  x_start = (x_start * 2. - 1.) * self.scale

  # noise sample
  x = self.q_sample(x_start=x_start, t=t, noise=noise)

  x = torch.clamp(x, min=-1 * self.scale, max=self.scale)
  x = ((x / self.scale) + 1) / 2.

  diff_boxes = box_cxcywh_to_xyxy(x)

However for inference, you divide x_boxes first and scaled back by multiplying. I thought the division by scale is not needed because the model learns from diffused boxes that is scaled back as in prepare_diffusion_concat(). Even if scaling is needed for predicting noise, I thought that the order should be reversed just like at the noising step to make conditions identical.

  x_boxes = ((x_boxes / self.scale) + 1) / 2
  x_boxes = box_cxcywh_to_xyxy(x_boxes)
  x_boxes = x_boxes * images_whwh[:, None, :]
  outputs_class, outputs_coord = self.head(backbone_feats, x_boxes, t, None)

  x_start = outputs_coord[-1]  # (batch, num_proposals, 4) predict boxes: absolute coordinates (x1, y1, x2, y2)
  x_start = x_start / images_whwh[:, None, :]
  x_start = box_xyxy_to_cxcywh(x_start)
  x_start = (x_start * 2 - 1.) * self.scale

Could you give an explanation why the scale is considered in inference stage or the scale is divided for the input?

And one more, why is self.ddim_sampling_eta set to 1 for initialization? Shouldn't eta be zero for DDIM?

I will appreciate you for any feedbacks.