VinAIResearch / WaveDiff

Official Pytorch Implementation of the paper: Wavelet Diffusion Models are fast and scalable Image Generators (CVPR'23)
GNU Affero General Public License v3.0
376 stars 29 forks source link

About timesteps setting for train and test #1

Closed vvictoryuki closed 1 year ago

vvictoryuki commented 1 year ago

Thanks for your excellent work! The sampling speed is amazing and really useful for other researchers to follow.

However, I still have several questions about the timesteps setting:

Q1: Why the number of timesteps for all datasets is so small (2 or 4)? As far as I know, the number of training timesteps for many diffusion models is set to hundreds of steps. During the sampling process, various acceleration algorithms (for example, DDIM) will be used to achieve sampling within tens of steps.

Q2: Does the setting of fewer steps weaken the advantages of the diffusion models over other generation models? For example, an extreme case is that the number of timesteps is 1. The diffusion model, in this case, looks pretty similar to StyleGAN.

Q3: Can the proposed Wavelet Diffusion Models work if I set the number of timesteps as 1000? I test the sampling speed when setting the sampling timesteps as 100, and the sampling speed is also quick enough so that I do not think the fewer sampling steps are the key point for the incredible sampling speed.

The reason why I pay so much attention to the number of timesteps is that there are many algorithms based on editing intermediate results generated by diffusion models. The proposed Wavelet Diffusion Models has obvious sampling speed advantages but is limited by the number of timesteps. It seems that the algorithm based on editing intermediate results cannot be effectively used in the Wavelet Diffusion Models.

hao-pt commented 1 year ago

We appreciate your interest in our paper. Q1. Our method is built upon DDGAN which requires much fewer sampling steps (e..g. 2 to 4) to generate an image. This is because DDGAN leverages adversarial training objectives to approximate the non-Gaussian distribution of large step sizes (a smaller number of steps induces a larger variance of noise added) which causes each denoising step $q(x_{t-1}|x_t, x_0)$ no longer follows unimodal Gaussian distribution as in conventional DPMs with smaller step sizes. Hence, we keep the same sampling steps as DDGAN for benchmarking on different datasets.

Q2. Yes, fewer steps might increase the complexity of the denoising distribution on each step. Obviously, there is a trade-off between the number of steps and model performance as the large sampling step is the main bottleneck of existing methods. If the number is 1, the model returns to GAN-alike model which directly models the complex distribution to generate images in one shot. In contrast, our approach (as well as DDGAN) decouples the denoising process into fewer steps that are simpler to model than GAN. Besides, the choices of sampling steps are varied for each dataset.

Q3. We are not certain that our method (as well as DDGAN) will work with larger timesteps. However, DDGAN has pointed out that the training might become more challenging when the number of timesteps is increased (> 4). Please check here for more details.

The core reason for our speed gain boils down to the 4-folds reduction of spatial dimension by Wavelet transformation in addition to the number of required sampling steps. Given input image X has shape of CxHxW, it is then transformed into 4CxH/2xW/2 which is then projected to the base width D by a linear layer in the denoising network, keeping the network width unchanged compared with DDGAN. Hence, it significantly reduces the computation of each sampling step thanks to 4x reduction of input spatial dimensions. Last but not least, our method aims to facilitate the research and real-time applications of diffusion models and the speedup is even more beneficial for hi-res images (e.g. 512 and 1024).

For editing purposes, it is beyond the scope of our work but we are looking forward to seeing any progress in later works. As the method is based on GAN, it is feasible to apply any relevant techniques from GAN or hybrid solutions from both to do editing.