NVlabs / ODISE

Official PyTorch implementation of ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models [CVPR 2023 Highlight]
https://arxiv.org/abs/2303.04803
Other
845 stars 45 forks source link

Some questions about the code #36

Closed haohang96 closed 10 months ago

haohang96 commented 10 months ago

Thank you for your outstanding work.

I have thoroughly reviewed the paper and the code. Most of it is clear and understandable. However, I find the following sections rather perplexing: self.alpha_cond and self.alpha_cond_time_embed

self.alpha_cond = nn.Parameter(torch.zeros_like(self.ldm_extractor.ldm.uncond_inputs)) self.alpha_cond_time_embed = nn.Parameter(torch.zeros(self.ldm_extractor.ldm.unet.time_embed[-1].out_features))

It appears that self.alpha_cond and self.alpha_cond_time_embed are used to interact with prefixes (as referenced here), which are generated by the Implicit Captioner. Subsequently, the results of this interaction are fed into the Latent Diffusion Model.

I'm curious about the necessity of the following operation (as mentioned here):

batched_inputs["cond_inputs"] = (self.ldm_extractor.ldm.uncond_inputs + torch.tanh(self.alpha_cond) * prefix_embed).

It seems that we could directly feed prefix_embed into the Latent Diffusion Model. I would like to understand the purpose and rationale behind introducing self.alpha_cond and self.alpha_cond_time_embed. Has any previous work employed such an operation?

I eagerly anticipate your response. Thank you very much.

haohang96 commented 10 months ago

Thank you, @shalinidemello @xvjiarui , for your valuable contribution to this great work. I kindly request your advice regarding the aforementioned questions.

xvjiarui commented 10 months ago

Hi @haohang96

Sorry for the late reply.

The operation is not used very often in previous works. We use it to make diffusion features kind of conditioned on our input images.

haohang96 commented 10 months ago

Thanks for your reply, I got it :)