Open Jason-Chi-xx opened 1 month ago
Thank you for your question. The timestep of the student model is fixed to 399 because the student needs to perform a one-step generation, and setting the timestep to 399 produces the best generation results. However, during the training process, we need to compute the difference in the predicted scores of the two diffusion models ($$s\theta$$ and $$s\phi$$) to update the generator, which needs to minimize the KL divergence between the two diffusion models’ distributions across all time steps (as proven in Theorem 1 of the ProlificDreamer). In this case, the noisy level is randomly sampled from a range between \text{min_step}(0.02 \times 1000) and \text{max_step}(0.98 \times 1000). Therefore, the two timesteps are different.
Got it. Thanks for your reply. I also have an another question that the setting reported in the paper when distilling the sdxl is different with that in github, in order to reproduce the result what setting should I use ? And in the degradation stage the generator and student seems to be initialized by different weights, which makes me confused:
generator_path = os.path.join(args.ckpt_only_path, "pytorch_model.bin") guidance_path = os.path.join(args.ckpt_only_path, "pytorch_model_1.bin")
In the code, the generator is initialized by the DMD2's ckpt and guidance_model is not initialized(the line is commented out. and self.fsdp is set to false)
I now have my own model based on sd15 to be distilled so I want to figure it out the initialization way.
I will be very grateful if you can answer these questions!
The distillation setup may vary depending on the training environment, such as different batch sizes. For other models, the degradation steps and distillation settings are also distinct. If you want to distill your own SD15 model, you might need to adjust these settings according to the specific circumstances to achieve optimal results.
During the degradation stage, the goal is to tune the teacher model to fit the student generator’s distribution. At this point, we only need to update the teacher model using traditional diffusion model optimization methods (score matching). In degradation.py, for simplicity, we use guidance_model as the teacher model for updates, allowing us to directly use the code already set up in train.py to update guidance_model (since during the distillation process, the update method for guidance_model is the same as that for traditional diffusion models, which is consistent with optimizing the teacher model in degradation stage). In lines 72-79 of degradation.py, we only load the parameters of the generator and do not load the parameters of the guidance_model. This is because, at initialization, the guidance_model uses the SDXL model from Hugging Face. If no new parameters are loaded, the guidance_model is the same as the teacher_model. I’m not sure if this answers your questions. If there are still unclear parts or further details you’d like to discuss, please feel free to ask!
Thanks again
Hi, respect for your awesome work! I have a question about the training. In backtracking stage, the generator's timestep is fixed to 399, and the timesteps of student and teacher are randomly sampled from min_step(0.02 1000) to max_step(0.98 1000). Is it more reasonable to set the min step higher than 399(i.e. generator's step)?