Stability-AI / generative-models

Generative Models by Stability AI
MIT License
23.43k stars 2.6k forks source link

question of ADD(Adversarial Diffusion Distillation), the inputs of teacher nets #246

Open YilanWang opened 7 months ago

YilanWang commented 7 months ago

in ADD, why the inputs of teacher nets are the denoised results, rather than the same noise inputs of student nets? many thanks!

jon-chuang commented 6 months ago

One plausible reason is the timesteps s and t, are sampled independently for the student and teacher models respectively. However, this is not a hard obstacle.

The simpler explanation is that it inherits from score distillation sampling which originates in the 3D generation domain (see Dream Fusion), where the inputs to the teacher (image model) and student (3D model producing differentiable 2D rendering) differ vastly.

This provokes the open question of whether feeding the noised original image rather than the noised student output will lead to better or worse results (whether from a final loss or loss curve perspective).

jon-chuang commented 6 months ago

rather than the same noise inputs of student nets

Note that your original suggestion is invalid. Due to the differing choice of noise sampling.

What is valid is the question of which input to noise.

jon-chuang commented 6 months ago

@qp-qp I'm not sure if you're privileged to share, but I think this is an interesting question that is worth shedding light on.

I have a feeling that feeding the original image may lead to degenerate results, as it simply amplifies the original dataset.

If the teacher models is perfectly faithful on the dataset, you would reproduce training on the original dataset.

Perhaps what is beneficial about distillation I.e. feeding the generated output is that it generates new diversity for the teacher model to provide feedback on.

However I am uncertain if any of these thoughts are valid.

jon-chuang commented 6 months ago

Actually, if you look at the definition of SDS, you will see that it is important that you use the noised generated output. That's because you can then interpret the loss in a nice way; it simplifies to predicting the noise at timestep t.

I think the intuition is that we need the generated output to be in the distribution of the teacher model. Using the noised student output just seems to give a better objective, at least mathematically - one has a ground truth noise to compare against.